Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?
398 points by syedkarim 46 days ago | hide | past | favorite | 478 comments
I seem to recall that Google consistently produced relevant results and strictly respected search operators in 2005 (?), unlike the modern Google. And back then, I think search results were the same for everyone, rather than being customized for each user.



Ha, yes, I've done that at https://gigablast.com/ . The biggest problems now are the following: 1) Too hard to spider the web. Gatekeeper companies like Cloudflare (owned in part by Google) and Cloudfront make it really difficult for upstart search engines to download web pages. 2) Hardware costs are too high. It's much more expensive now to build a large index (50B+ pages) to be competitive.

I believe my algorithms are decent, but the biggest problem for Gigablast is now the index size. You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough because I don't have the cash for the hardware. btw, I've been working on this engine for over 20 years and have coded probably 1-2M lines of code on it.


You can be whitelisted so Cloudflare doesn't slow you down (or block you): https://support.cloudflare.com/hc/en-us/articles/36003538743...


It's not quite that easy. Have you ever tried it? See my post below. Basically, yes, I've done it, but i had to go through a lot and was lucky enough to even get them to listen to me. I just happened to know the right person to get me through. So, super lucky there. Furthermore, they have an AI that takes you off the whitelist if it sees your bot 'misbehave', whatever that is. So if you have a certain kind of bug in your spider, or your bot 'misbehaves', whatever that means is anyone's guess, then you're going to get kicked off the list. So then what? You have to try to get on the whitelist again? They have Bing and Google on some special short lists so those guys don't have to sweat all these hurdles. Lastly, their UI and documentation is heavily centered around Google and Bing, so upstart search engines aren't getting the same treatment.


Cloudflare is not the only gatekeeper, too. Keep that in mind. There's many others and, as an upstart search engine operator, it's quite overwhelming to have to deal with them all. Some of them have contempt for you when you approach them. I've had one gatekeeper actually list my bot as a bad actor in an example in some of their documentation. So, don't get me wrong, this is about gatekeepers in general, not just only Cloudflare and Cloudfront.


But your treatment one could say sites fronted by cloudflares are part of a closed web


I dunno if y'all realise this but I'd pay for a search engine that black holes CloudFlare and any other sites that think bots shouldn't read their sites.


rip the internet if you do that =/


> You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough

I wionder how much this is true, and how much (despite all our rhetoric to the contrary) it's because we have actually come to expect Google's modern proprietary page ranking, which counts more than just inbound links but all sorts of other signals (freshness, relevance to our previous queries, etc.).

We dislike the additional signals when it feels like Google is trying to second-guess our intentions, but we probably don't notice how well they work when they give us the result we expect in the first three links.


>but we probably don't notice how well they work when they give us the result we expect in the first three links.

For me the experienced quality of Google search results gave have dropped massively since 2008, despite (and maybe even because of) all their new parameters.

When someone says this someone else usually immediately says it is because of web spam and black hat SEO.

But black hat SEO doesn't explain why verbatim doesn't work for many of us.

Black hat SEO doesn't explain why double quotes doesn't work.

Black hat SEO doesn't explain why there is no personal blacklists so all those who hate pintrest can blacklist them.

Black hat SEO probably also doesn't explain why I cannot find a unique strings in open source repos and instead get pages of not exactly webspam but answers to questions I didn't ask.


I think people also have an inflated recollection of how good Google actually was back in 2005.

Back then Google was only going up against indexes and link-rings, not 2021 Google/Bing/DDG/etc.


> I think people also have an inflated recollection of how good Google actually was back in 2005.

I've been pointing this out for at least close to a decade.

I know since I bothered to screenshot and blog about it in 2012.

I'll admit mistakes happened back then too, but they were more forgivable like keyword stuffing on unrelated pages. Back then Google were on our side and removed those as fast as possible.

Today however the problem isn't that someone hss stuffed the keyword into an unrelated page but that Google themselves mix a whole lot of completely irrelevant pages into the results, probably because some metrics go up when they do that.

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

Note that I'm not necessarily suggesting an grand evil master plan here, only that end-to-end metrics will improve as long as there is no realistic competition.


> Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

This would mean that google were measuring the quality of their search results by the number of ad impressions which seems unlikely to me. Maybe in some big, wooly sense this is sort of true but it seems pretty unlikely that anyone interested in search quality (i.e. the search team at google) is looking at ad impressions.


I've been using Altavista at that time, every now and then switching to Northern Light. Everything else was abysmal. Google blew them out of the water in terms of speed, quality, simplicity, unclutterdness and everything else. I can't remember ever retraining muscle memory so fast when switching to Google. So, no, Google has been great then and apart from people actively working against the algorithm is still good now, but obviously a completely different beast.


I think the parent's point was that people say Google 2005 >> Google 2021, but it's pretty hard to make this comparison in an objective way. No doubt Google 2005 was way better than other offerings around at the time.


2005? There were loads of other search engines (SE), and many meta-SE: hotbot, dogpile, metacrawler, ... (IIRC), plenty more.

There was also indexes, which Yahoo, AOL (remember them!) had but there was, what was it called, dmoz?, the open web directory. When Google started, being in the right web directory gave you a boost in SERPs as it was used as a domain trust indicator, and the categories were used for keywords. Of course it got gamed hard.

Google was good, but I used it as an alt for maybe 6 months before it won over my main SE at the time. I've tried but can't remember what SE that was, Omni-something??

One of the main things Google had was all the extra operators like link: inurl:, etc., but they had Boolean logic operators too at one point I think.


I've tried but can't remember what SE that was, Omni-something??

Google replaced Altavista in my usage, who in turn were usually better than their predecessors.


I used them all and kept using the ones that gave me unique results. Google was hands down better because of pagerank and boosts to dmoz listed sites and because they scanned the whole page ignoring keywords.


Google was good, actually very good back in 2000s. Their PageRank algorithm practically eliminated spam pages that were simply a list of keywords. Before Google, those pages came up on the first page of Altavista.

I don't specifically remember 2005, but the quality went down with more modern but still shady SEO practices.


No, quality went down because google shat the bed. All the changes have been deliberate.


I hate google now. Every time I use it by accident I’m reminded how infuriating it is. I know DuckDuckGo is just bing in a Halloween mask, but I’ll gladly use something that’s not awesome as long as it’s also not infuriating. I’d take 2005 google any day.


Well if the result didn't appear in the first 5-10 pages, it's probably not in the index.

You can see it with other search engines. I challenge you to come up with a Google query for which a first-page result won't be seen within the first 10 pages of Bing results for the same query.

(Bonus points if that result is relevant).

There's only so much tweaking that personalization and other heuristic can do.

But if something is missing from them index, that's it.


I would like to see the least relevant search result Google comes up with. :)

Yes, I realize this is probably trivial with an API call, but I always found it interesting there isn't a way to see what the site with the lowest pagerank in the index is.


It sounds to me like your challenge includes anything which is in Google's index but not Bing's? Is that intentional?


I assume the author has the ability to search the index to see if your preferred Google result is even indexed.


I've used Gigablast off and on for a long time (I think I first discovered Gigablast in 2006 or so). Would be cool to have a registration service for legitimate spiders. I used to run a team that scraped jobs and delivered them (by fax, email, us mail as require by law) to local veteran's employment staffers for compliance. We were contracted by huge companies (at one point about 700 of the fortune 1000) to do so, and often our spiders would be blocked by the employer's IT department even though the HR team was paying us big bucks to do so.


Dude, I use your engine regularly, it is spectacular. The amount of work you put into this takes some dedication.

I was curious if you ever intend to implement OpenSearch API so that we could use it as default in browser or embed it in applications?

Also how can people contribute to help you maintain a larger index and/or keep the service going?


Nice.

I'd pay 5-10$/mo for a search engine that didn't just funnel me into the revenue-extracting regions of the web like Google does.


A subscriber-supported search engine sounds cool to me. Any precedent?


Copernic ( https://copernic.com/ ) had Copernic Agent Professional, a for-pay desktop application that had really good search features, a while ago . Not sure if they discontinued it.


Wow blast from the past. I think I was using Copernic all the way back in 2003... Forgot all about them. Thanks!


As a general rule, nobody is willing to pay what they are worth to advertisers. Facebook makes 70$ / y / user in the US. You would pay $70 for an ad-free Facebook? Congratulation, you must be an above-average earner. Also: your value to advertisers just tripled. If you are willing to pay $210, it will immediately triple again.


Great point! So simple, but as someone who has never worked on this side of things I never thought about it.

How would legal limitations on data collection, like GDPR, influence the ratio? None? Only an insignificant degree? Or enough to actually influence business decisions?


You'll like https://neeva.com/


How do they pay for it?


From the FAQ:

> …Eventually, we plan to charge our members $4.95/month.


Kagi.com does this. In closed beta at the moment, but you can email and request access.


I've tested Kagi a bit. It nicely gave me exactly what I wanted even in cases where names could have different meanings in different contexts (I tested with Kotlin)

The basic results are good with some nice touches here and there like including a "blast from the past" section with older results which is actually what I sometimes want and another section where it widens up a bit (i.e. what Google does by default?).

Furthermore you can apply pre defined search "lenses" that focuses your search, or even make your own, and you can boost or de-rank sites.

I had not expected this to happen so quickly but I'm going to move from DDG to kagi as my default search engine for at least a couple of days because I am fed up with both Googles and DDGs inability to actually respect my queries.

If ir continues to work as well as it does today I'll happily pay $10 a month and I might also buy 6 months gift cards for close friends and family for next Christmas.

Think about it, unlike an ad financed engine incentives are extremely closely aligned here: the smartest thing Kagi can do is to get my results as fast as possible to conserve server resources (and delight their customer).

For an ad financed engine abd especially one that also serves ads on search results pages as well the obvious thing to do is to keep me bouncing between tweaking my search query and various that almost has my question answered but not quite.

(That said, if one us going to stay mainstream I recommend DDG over Google since 1. for me at least Googles results are just as bad and 2. with DDG it is at least extremely easy to check with Google as well to see if they have a better result 3. competition is good)


Perhaps trolling the entire web is not useful today? I’d love a search engine where I can whitelist sites or take an existing whitelist from trusted curators.


Heh, I guess you mean "trawling" - trolling the entire web is something very different :)


Then again, if you look at today's search results, where everything above the fold belongs to Google, maybe we have been trolled indeed.


Depending on the intended metaphor, trolling could work too :) https://en.wikipedia.org/wiki/Trolling_(fishing)


What would trolling the entire web look like?


It would look like a modern search engine with innovative technology offerings like Advanced Mobile Pages.


Wow, you’re right. Trolling the entire web would involve an organization that carries considerable authority whose decisions can impact every member of the web.

AMP is the perfect way to troll websites into making shitty versions of their content, for no real reason other than just because you feel like it. And then when you’re satisfied with your trolling you just abandon the standard.


reddit



Not in this context - "trolling" as described there would apply to targeted indexing of a specific site; while "trawling" would refer to a wide net that attempts to catch all the sites.


Well, no, it's not fine.

See e.g. the source you linked, which explains the difference.


Did you read to the end? Methinks not!


>Did you read to the end? Methinks not!

Methink harder.

>Troll for means to patrol or wander about an area in search of something. Trawl for means to search through or gather from a variety of sources.

We were talking about gathering information from a variety of sources to build a search engine index.


Trusted curators is a dangerous dependency


Trusted consumers are better. The original page-rank algo was organic and bottom-up. But now it's the person not the page. Businesses compete for interaction not inbound links. So if you can make a modern page-rank that follows interaction instead of links and isn't a walled garden then I'd invest.


I could make that work, but what do you mean by "walled garden" in this context?


the business and allies of google - those entrenched interests that limit the current visibility of the web to themselves


That’s why you don’t make it a hard dependency and let people curate their own list of taste makers. They can share and exchange info about who good taste makers are and good one might even charge for access to exclusive flavors.


It is. The alternative is scooping everything and using algos to curate. That seems worse imo.


Perhaps vote on results like on Reddit posts? Gets the junk sites down (and out of the index eventually).


Any open voting system is going to be under serious SEO pressure.

That’s the real issue, Google has indirectly infected the web with junk sites optimized for it. Any new search engine now has a huge hurdle to sort through all the junk and if it succeeds the SEO industry is just going to target them.

A more robust approach is simply pay people to evaluate websites. Assuming it costs say 2$ per domain to either whitelist or block that’s ~300 million for the current web and you need to repeat that effort over time. Of course it’s a clear cost vs accuracy tradeoff. Delist sites that have copies and suddenly people will try to poison the well to delist competitors etc etc.


Adding a gatekeeper collecting rent isn't a solution - the people using SEO are already spending money to get their name up high on the list.


This is money spent by a search engine not money collected from websites. People don’t ever want to be sent to a domain parking landing page for example.

More abstractly SEO is inherently a problem for search engines. Algorithms have no inherent way to separate clusters of websites setup to fake relevance from actually relevant websites. Personally I would exclude Quora from all search results, but even getting to the point your trying to make that kind of assessment is extremely difficult in the modern web. Essentially the minimum threshold for usefulness has become quite high which is a problem as Google continues to degenerate into uselessness.


Given Reddit is notorious for it's problems with astroturfing and vote bots, I don't think this is a particularly promising approach.


Reddit is a heavily gatekeeped community by the mods in regards to specific topics


Reddit is an extreme example of group think. Try posting something pro-Trump (I mean, surely even that guy has a positive thing or two to be said about him) and you'll get banned in some subs. Or you may get banned simply because the mod doesn't like the fact that you don't toe the party line.


Also, vote bots


That just means that you have to curate the people allowed to vote. Otherwise, it would be rule by the obsessed and the search engine optimizers, and the junk sites will dominate the index.

I'm not convinced that Google's recursive AI algos aren't a functional equivalent. They let you vote by tracking your clicks.


Plus, it scales less well than pure algorithmic search. This fight already happened, with a much smaller internet.


It works really, really well for libraries. Research libraries (and research librarians) are phenomenally valuable. I've missed them any time I'm not at a university.

Both curators and algorithms are valuable. This goes for finding books, for finding facts and figures, for finding clothes, for finding dishwashers, and for pretty much everything else.

I love the fact that I have search engines and online shopping, but that shouldn't displace libraries and brick-and-mortar. Curation and the ability to talk to a person are complementary to the algorithmic approach.


> It works really, really well for libraries

It scales extremely poorly. It works very well for situations where there are customers/sponsors are willing to spend lots of money for quality, because then the cost scaling doesn't matter as much; research libraries, Lexis/Nexus Westlaw, etc. all do this, but it's not cheap, and the cost scaling with the size of the corpus sucks compared to algorithmic search.

It is among the approaches to internet search that lost to more purely algorithmic search, because it scales poorly in cost.


+book stores. Curators can use algorithms to help them curate… Google’s SE is taking signals from poor curators imo.


How about just a meritocratic rating? Even here on HN I would appreciate some sort of weight on expert/experienced opinion. Although in theory I like the idea that every thought is judged on its own, the context of the author is more relevant the deeper the subject. That's one of the reasons I still read https://lobste.rs. It has a niche audience with industry experience.


Lobsters is a great example of the benefits _and dangers_ of expert/experienced opinion. Lobsters is highly oriented around programming languages and security and leaves out large swaths of what's out there in computing. That's fine of course, but it creates a pretty big distortion bubble that's largely driven by the opinions of the gatekeepers on the site rather than a more wide computing audience.


Nothing is meritocratic. I think the term came into our lexicons because of a sociologist satirizing society and writing about how awful a “true” meritocracy would be.


> meritocratic rating

That is literally PageRank.


Pagerank was mostly based on inbound links. A popularity contest with some nuance is just that. Nothing is meritocratic including any Google algo.


It's not merely a democratic vote, where the most links wins, but what the algorithm does is evaluate the links based on the popularity of the originating domain. In other words, meritocratic rating.

You can apply the algorithm to any graph, and what it does is find the most influential nodes.


Meritocracy isn’t a thing. The person who coined it was rightfully mocking society. The recent 2019 Meritocracy Trap book goes further into this.

Your explanation is not “meritocratic”. The wealthy and powerful largely stay on top with the nuance you provided. The popular have more popular. are able to make what they link to more powerful and thus more popular. There is no meritocracy there.


I’m really interested in this as well. I use DDG and whenever I’m doing research I tend to add “.edu” because there are so many spam sites.


If the user requests a website, you could at least crawl on request, which would be an excuse to bypass the rules in robots.txt. It would be a loophole, let’s say.


ha nice to hear this idea. I'm planning to work on this as a side project, just started recently


That's a great idea.


Interesting. I had some interests in building a search engine myself (for playing around ofcourse). I had read a blog post by Michael Nielson [1] which had sparked my interest. Do you have any written material about your architecture and stuff like that? Would love to read up.

[1]: https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...



Holy, thats a huge codebase. Github even shows no code/syntax hl for many cpp files because they are so big.

I fiddled around and searched for some not so well known sites in germany and the results were surprisingly good. But it looks really... aged.


Holy shit. Click on random .cpp file. Browser hangs. O_O


Thank you.


> Cloudflare (owned in part by Google)

Please elaborate. Is there a special relationship between Cloudflare and Google?



That is not the same as being owned by Google.


Especially since Cloudflare went public back in 2019, at which point any investors cashed out.

- Sincerely, a Google employee who has nothing to do with the investment branch of the company


> at which point any investors cashed out.

Well, actually that is also not true. At IPO preferred stocks convert to common but the investors can keep their ownership, they can but don't have to cash out or can only partially cash out.

Investors can also keep board seats in many (or most?) cases.


In this example, I don’t think it matters if Google Ventures kept their shares or not. So long as they are treated as any other stock holder, I don’t see an issue. If they still maintain a board seat, then there might be an issue, but I don’t see a problem with simply holding shares.


I don't know anything about this particular case, but it's very common for VCs to cash out at IPO or not long after. VCs identify good investments among early stage companies; they don't want to keep their money tied up in investments outside of their specialty.


Actually, being an investor in a company is the same as owning that company in part.


Where did you read that google/alphabet owns part of Cloudflare?


Assuming OP is referring to Google Venture's participation in at least one of Cloudflare's rounds.

https://www.crunchbase.com/funding_round/cloudflare-series-d...


Have you ever looked at the Amazon file?

I'll see if I can track down the link but I remember somebody sharing a dump with me from Amazon that apparently was a recent scrape.

Edit: https://registry.opendata.aws/commoncrawl/


That's Common Crawl, they do the spidering of some billions of webpages but that's still a tiny percentage of the web versus Google or Bing.


Common Crawl is being used to train the likes of GPT-3 and mine image-text pairs for CLIP. I wonder how much useful content is missing, we're going to use all the web text, images and video soon and then what do we do? We run out of natural content. No more scaling laws.


Do you have any stats on that? I've always wondered about the coverage of Common Crawl, if you include all the historical crawl files too.


Oh interesting, I've played with it a little but not a dev and I've always wondered what the coverage was like.


If you're serious about this, add a paid tier. Until it's free, I don't trust you will not ever sell my data to make bank.


Why do people think a paid tier will prevent their data from being sold after pocketing it? Aside from that if they go bankrupt then it isn't theirs to not give away anymore for one.


You are going to pay for a generalized web search when DDG/Google/Bing/etc are free?


Yes. I use Brave Search and I hope they add a paid tier, which I think they have confirmed they'll add at a later date.

If you don't pay, you are the product. Simple as that.



There are a lot of products you pay for, and still are the product.


Telegram, Signal, Mozilla are counterexamples... Have a large charitable donated cash balance sitting in your account, and your organisational motivation is all different


Mozilla Foundation does not fund Firefox, that's in an arms-length wholly owned for-profit subsidiary and Google is main source of funding via the search deal.


> If you don't pay, you are the product.

If not enough people pay, there's no product.


If nobody pays, there's even less of it. Not sure what's your point.


I would - the problem with those services is that they prioritise the results that generate the search engine the most money rather than give me the best results, and then indexes my searches to track and advertise to me throughout the web.

A clear pricing transaction sounds much nicer to me. Should generate better results too.


What we need is a net neutrality doctrine on the server side. Bandwidth is hardly scarce outside of AWS's business model. Ban the crawler user-agent dominance by the big search engine players. "Good behaviour" should be enforced via rate limiting that equally applies to all crawlers, without exemption for certain big players.



I hadn't used gigablast before, but a quick test had it find some very old, obscure stuff, as the top hit. Well done. However, the link on the front page to explain privacy.sh comes up with "Not Private" in Chrome. The root Cisco Umbrella CA cert isn't trusted. Oops.


I tried out four search words with your search engine, and I am not convinced that it is mainly the index size and not the algorithm that is to blame for bad search results. There are way too much high ranking false positives. Here is what I tried:

  a) "Berlin": 

  1. The movie festival "Berlinale"
  2. The Wikipedia entry about Berlin
  3. Something about a venue "Little Berlin", but the link resolves to an online gaming site from Singapure
  4. "Visit Berlin", the official tourism site of Berlin
  5. The hash tag "#Berlin" on Twitter
  6. "1011 Now" a local news site for Lincoln, Nebraska
  7. "Freie Universität Berlin"
  8. Some random "Berlin" videos on Youtube
  9. The Berlin Declaration of the Open Access Initiative
  10. Some random "Berlin" entries on IMDb
  11. A "Berlin" Nightclub from Chicago
  12. Some random "Berlin" books on Amazon
  13. The town of Berlin, Maryland
  14. Some random "Berlin" entries on Facebook
  15. The BMW Berlin Marathon
  
  b) "philosophy"

  1. The Wikipedia entry about philosophy
  2. "Skin Care, Fragrances, and Bath & Body Gifts" from philosophy.com
  3. "Unconditional Love Shampoo, Bath & Shower Gel" from philosophy.com
  4. Definition of Philosophy at Dictionary.com
  5. The Stanford Encyclopedia of Philosophy
  6. PhilPapers, an index and bibliography of philosophy
  7. The University of Science and Philosophy, a rather insignificant institution that happens to use the domain philosophy.org
  8. "What Can I Do With This Major?" section about philosophy
  9. Pages on "philosophy" from "Psychology Today". I looked at the first and found it to be too short and eclectic to be useful.  
  10. The Department of philosophy of Tufts University
  
  c) "history"

  1. Some random pages from history.com
  2. "Watch Full Episodes of Your Favorite Shows" from history.com
  3. Some random pages from history.org
  4. "Battle of Bunker Hill begins" from history.com
  5. Some random "History" pages from bbc.co.uk
  6. Some random pages from historyplace.com
  7. The hash tag "#history" on Twitter
  8. The Missouri Historical Society (mohistory.com)
  9. Some random pages from History Channel
  10. Some random pages from the U.S. Census Bureau (www.census.gov/history/)
  
  d) "Caesar"

  1. The Wikipedia entry about Caesar
  2. Little Caesars Pizza 
  3. "CAESAR", a source for body measurement data. But the link is dead and resolves to SAE International, a professional association for engineering
  4. The Caesar Stiftung, a neuroethology institute
  5. Some random "Caesar" books on Amazon
  6. Hotels and Casinos of a Caesars group
  7. A very short bio of Julius Ceasar on livius.org 
  8. Texts on and from Caesar provided by a University of Chicago scholar
  9. (Extremely short) articles related to Caesar from britannica.com
  10. "Syria: Stories Behind Photos of Killed Detainees | Human Rights Watch". The photos were by an organization called the Caesar Files Group
  
So what I can see are some high ranked false positives that are somehow using the search term, but not in its basic meaning (a3, a11, b2, b3, d2, d3, d4, d6) or not even that (a6). Some results are ranking prominently although they are of minor importance for the (general) search term (a9, a13, b7, b8 -- perhaps a15 and d10). Then there are the links to the usual suspects such as Wikipedia, Twitter, Amazon, etc. (a2, a5, a8, a10, a12, a14, b7, c5, d1, d5); I understand that Wikipedia articles are featuring prominently, but for the others I would rather go directly to eg. Amazon when I am interested in finding a book (or use a search term like "Caesar amazon" or "Caesar books"). Well, and then there are the search results that are not completely off, but either contain almost no information, at least compared to the corresponding Wikipedia article and its summary (b4, b9, d7, d9), or that are too specific for the general search term (c1, c2, c3, c4, c6, c9, c10).

That leaves me with the following more or less high quality results (outside of the Wikipedia pages): a1, a4, a7, b5, b6, b10, and d8. The a15 and d10 results I could tolerate if there had been more high quality results in front of them; but as a fourth and second, respectively, good result they seem to me to be too prominent. Also in the case of "Berlin" a4 should have been more prominent than a1, and a7 is somewhat arbitrary, because Humbolt University and the Technical University of Berlin are likewise important; what is completely missing is the official Website of the city of Berlin (English version at www.berlin.de/en/).

All in all, I would say that your ranking algorithm lacks semantic context. It seems the prominence of an entry is mainly determined by either just being from the big players like Twitter, Youtube, Amazon, Facebook, etc. or by the search term appearing in the domain name or the path of the resource, regardless of the quality of the content.


I don't know about others, but when I think of the "good old google days" I'm _not_ expecting the results for your example queries to be any good.

In those days querying took some effort but the effort paid off. The results for "history" just couldn't matter less in this mindset. You search for "USA history" or "house commons history" or "lake whatever history" instead. If the results come up with unexpected things mixed in, you refine the query.

It was almost like a dialog. As a user, you brought in some context. The engine showed you its results, with a healthy mix of examples of everything it thought was in scope. Then you narrowed the scope by adding keywords (or forcing keywords out). Rinse and repeat. As a user, you were in command and the results reflected that.

The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.


> The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.

Bingo! If you cede control to Google, it _will_ do what it's optimized to do, and not what _you_ are looking for.


What is optimized to do says nothing.

Optimizing for open text queries means dealing with a massive search space, the thing is choosing a subspace where to search and that is the part engines have to refine, how that is done is a different story. Some people may agree to use their location, search history and visits to online stores to do so but some may not.


This is why, in the good old days, my favourite search engine was Alta Vista. In its left margin it had arranged key words like a directory tree that could be used to further refine the search. So my ideal search should do something like this if I type in a generic term: provide me with relevant information about the general topic and than help me to refine my search. The way of Wikipedia to provide a principal article and a structured disambiguation page is the way I would prefer.

I admit, my evaluation of the search engine was just a simple test how much I can get out of the results for some generic key words in the first place. A more detailed evaluation should, of course, look deeper. It was more of a test balloon to see if this search engine raises any hope that it could be better than Google with regard to my own (subjective) expectations of a decent result set.


I get what you mean, but part of the whole initial appeal of Google was that it gave much more relevant results initially than Altavista or the other options. That was why Google put in the audacious "I'm feeling lucky" button.


Yeah but it's from that same philosophy that Google Search is useless as it optimises for the first result.

There is no search engine that searches literally for what you asked and nothing else. Search is shit in 2021 because it tries to be too clever. I'm more clever than it, let me do the refining.


>"I'm feeling lucky" button

My brain got so used to ignoring it I completely forgot it's a thing. I'm also unclear what it does? On an empty request, it gets me to their doodles page and with text in the box, gets me to my account history landing page.


It automatically redirects to the first search result.


Right, not sure why it wasn't working yesterday as opposed to now, I swear I wasn't doing it wrong (or how I could've).


This was the result of two things. mapreduce. using links rank the pages.

Using links to rank the pages is not really possible any longer because of seo spam links.


I think you have some great feedback here but for me it also highlights how subjective search results can be for individuals - for example, these false positives that you mention (b2, b3) appear as the top result on Google for me for that query.

It makes me think there must be some fairly large segment of the population that want that domain returned as a result for their query, no?


I would not deny that a large part of subjectivity is involved. This is why I used several markers of subjectivity in my evaluation ("what I can see", "that leaves me", "they seem to me", "I would say", etc.). And related to that: I also agree with other responses that a search often needs to be refined. So my four examples where in no way an exhaustive evaluation, but an explorative experiment, where I just used two proper names, one for a city and one for a historical person, and two general disciplines as search words, in order to see what happens and what is noteworthy (to me). So much to the subjective side.

But what can be said about ideal search results for these terms beyond subjectivity? I do not think that we can arrive at an objective search result, but are nevertheless allowed to criticise search results with respect to their (hidden or obvious) agenda.

Let me give an example of the good old days: When I was searching for my surename on Google in the early 2000s the search results contained a lot of university papers or personal Web-sites (then called "homepages") from other people of that name. But suddenly, I can't remember when exactly this was, the search results contained almost exlusively companies that contained that surename in its company name. The shift was not gradually, as if it were representing a slow shift in the contents of the Internet itself, but abrupt. It was apparently due to an intentional modification of the ranking algorithm that put business far above anything else on the Internet.

My explanation for this is the following: The objective metrics for Google search results is the stream of revenue they generate for Google. But not only for Google. The fundamental monetary incentive to Bing (and its derivative Ecosia) is more or less the same. And how different the impact of the somewhat different business model of Duck Duck Go is, is open for debate.

If maximum revenue is the goal, the aim is to provide the best search results according to the business model (advertisment, market research and whatever else) without driving the users away. But the best search results according to the business model are not necessarily the optimal search results for the typical user. And as long as all relevant competitors are following the same economic pressure of maximizing revenue the basic situation an thus the qualitiy of the search results for the user will not improve above a certain level. If we want this situation to change, we need competitors with a different, non-commerical agenda. Either from the public sector (an analogon to the excellent information services about physical books provided by libraries) or from non-profit organizations (an analogon to Wikipedia or Open Street Map).

To answer your question about b2 and b3: I checked with other search engines; besides Google they appear for me also on Bing (as #8, same product but on a different Web-site) and Duck Duck Go (as #10); Bing also has a reference to them in the right margin as a suggestion for a refined search (this time exactly b2 and b3). Although I do not think that the results from those search engines should be considered as a general benchmark for good search results for the reasons given above, we may speculate why they appear on the first page of search results. I would guess that it is a combination of gaming the search engines by using a generic term as a product and domain name to get free advertising, and search engine algorithms making this possible by generally ranking products and companies high in their search results.


Oh of course you can criticise the result, I more found it interesting that a billion dollar, optimized search experience thought your false positive was actually a top result. A huge variance in the subjectivity between your experience and their invested reasoning.

But while we're speculating on how the domain the appears at the top of the list, let me hazard a guess...

Philosophy.com was registered in 1999 and according to waybackmachine, has been selling cosmetics on the site since 2000 (20+ years). The company sold in 2010 for ~$1B to a holding company with revenues of $10B+ today (Unfortunately I couldn't find how much it contributes to that revenue). According to Wikipedia, the Philosophy brand has been endorsed by celebrities, including "long-time endorser" Oprah Winfrey, possibly the biggest endorsement you could get for their industry/demographic.

I think it is a long established business, with strong revenues and there's more people online searching for cosmetic brands than for philosophers.

In the same way (admittedly in the extreme) when I'm researching deforestation and I query to see how things are going for the 'amazon', the top result is another successful company registered pre 2000, with strong revenues that most likely attracts more visitors..


Okay, you convinced me that it should (inter-subjectivly) count not as a real false positive as I first thought.

Nevertheless, when I try to analyze what is going on here, I would rather use the word "context" instead of "subjectivity", since I think (or at least hope) that my surprise to find this brand on place #2 in my Google results for "philosophy" is shared by quite a lot of people who lack the context to give it meaning, because this brand is unknown to them. I have the excuse that it is a North American brand irrelevant in my German context. Interestingly, when I search for "philosophy" on amazon.com (without refining the search), I get almost exclusively beauty products and related items as a result, but when I search for "philosophy" on amazon.de it is only books. Google nevertheless has the beauty brand as #2 in Germany. Can we agree that Amazon is better at considering the context of the search for "philosophy" than Google?

As an aside: Your "amazon" example reminds me when I was searching for "Davidson" expecting to find information about Donald Davidson, but received a lot of results about Harley-Davidson. (But since I was aware of the importance of this brand, it was understandable to me.)


We can agree on that, yes =)

I was thinking about this and when you look at the top keyword searches on Google, it's dominated by people searching brands each year, so I think Google is just naturally optimised for this. I think any Search Engine designed for the masses would probably have to behave like this too. https://www.siegemedia.com/seo/most-popular-keywords

I agree, I think the early web was used more for general information rather than specific brand information (and was more useful for people like myself). I'm not sure what is needed to get more results such as university papers or personal web-sites as I think that people use the internet differently now and that the link structure reflects that.

It's interesting that Google isn't used to search for people anymore (I couldn't see any people in the recent top 100 keyword search data).


Some observations:

Most of the "brands" in the top 100, especially at the beginning, are rather Internet services. These search terms seem to have been entered not with the intention to "search" in the sense to find some new information, but as a substitute for a bookmark to the respectice service. Who searches for #1 "youtube" does not want information about youtube, but wants to use the youtube Web-site as a portal to find videos there.

I would also guess that most of these searches haven't been initiated through the Google Web-site, but directly from the browser's adress/search bar or a smartphone app. They exhibit a specific usage pattern, but do not show what the people, that entered them, were really searching for, if they were searching at all. What are those people who search for "youtube" doing next: either search again on youtube or log into their youtube account and browse their youtube bookmarks.

The early Internet did not have so many different service people used at a daily basis, and those that existed were more diverse (think of the many differen online email providers in those days) so that the search terms spread out more. Also browsers had no direct integration with a search engine. The incentive was higher to use bookmarks for your favourite service, since otherwise you had to use a boomark to a search engine anyway.

Perhaps it would be more approbriate to compare the use of the early Google not with the current Google, but the current Google Scholar?


You inspired me to try an even less specific search: thing

Subjectively felt the gigablast results were a relative delight.


No bad idea. At the risk of being sidelined: "philosophy" was not so a bad term either. Start with an arbitrary Wikipedia link and click on the first keyword of the summary after the linguistic annotations (or other annotations in brackets) and repeat the process until you reach a loop. You will almost always end with "philosophy" -> "metaphysics" -> "philosophy" -> ... This works for "Berlin", "history" and "Caesar" as well as for "thing". For the latter very fast: "thing" -> "object" -> "philosophy".


that's tripped out. where did you hear about that?


I can't remember. Probably on Hacker News.


I'll admit I had not been working on the quality of single term queries as much as I should have lately. However, especially for such simple queries, having a database of link text (inbound hyperlinks and the associated hypertest) is very, very important. And you don't get the necessary corpus of link text if you have a small index. So in this particular case the index size is, indeed, quite likely a factor.

And thank you for the elaborate breakdown. It is quite useful and very informative, and was nice of you to present.

And I'm not saying that index size is the only obstacle here. I just feel it's the biggest single issue holding Gigablast's quality back. Certainly, there are other quality issues in the algorithm and you might have touched on some there.


Let me add just one thought on the single term searches: I do not think that a good search result for such terms as "philosophy" should focus on the primary meaning of the term alone. As someone else had pointed out, the beauty brand can be quite important for some people. If we look at a search engine as a tool that needn't present me with near perfect results from the outset, but something I can have a dialogue with to find something, than it is best that results for single terms presents me with a variety of different special meanings (and probably some useful suggestions how to refine my search). Perhaps you can scrap the Wikipedia disambiguation pages and use it somehow to refine your search results.


Let's compare with google:

- Berlin:

Wiki

Berlin travel site (visit Berlin)

website for Berlin

Youtube videos

Britannica for Berlin

Bunch of US town sites named Berlin

- Philosophy:

Same skincare website is first result

Wiki is second

Britannica is third

Stanford

News stories

Other dictionaries and encyclopedias

- History"

history.com is first result

Then is the "my activity" google site, maybe this is actually relevant

Youtube, lots of history channel stuff

Twitter history tag

Wikipedia for "History"

How to delete your Chrome browser history

Dictionary definitions

- Caesar:

Wiki for Julius Caesar

Britannica

BBC for JC

Google maps telling me how to get to Little Caesar's Pizza

Dictionary

Apparently some uni has a system called CAESAR

biography.com

Caesar salad recipe

history.com

images for Caesar


OK, I'll bite. How would _you_ rank the results for each of those queries?


With a slightly fresher coat of paint this could be very popular. For example, no grey background.


Great job, I didn't know aboug Gigablast and it looks very interesting. Can I give you a small piece of feedback? I just tried searching for myself on Gigablast, and the first results are profile pages which haven't been updated since like 2005. Meanwhile, my own personal page appears on the very bottom of the results.

So my suggestion would be to lower the weight of the ranking of the domain, and promote sites which have a more recent update date.

Send me an email (contact in profile) if you want to follow up on this feedback!


Regarding the gatekeeper problem: it's a wild guess but maybe if there was a way to involve users by organizing distributed scraping just for the sake of building a decent index, I'm sure many of them would help.


yes, large proxy networks are potential solutions. but they cost money, and you are playing a cat and mouse game with turing tests, and some sites require a login. furthermore, people have tried to use these to spider linkedin (sometimes creating fake accounts to login) only to be sued by microsoft who swings the CFAA at them. so you start off with an intellectual desire to make a nice search engine and end up getting sidetracked into this pit of muck and having microsoft try to put you in jail. and, no, i'm not the one microsoft was suing.


Not sure if you're looking for feedback, but the News search could use some work, I searched for "Ethiopia" and almost all of the articles were unrelated to Ethiopia except for the existence of some link somewhere on the page.

Your general web search seems pretty good, although I've just given it a casual glance. I think your News search could be improved by just filtering the general search results for News-related content, since the "Ethiopia" content I get there is certainly Ethiopia-related.

In any case, an interesting product, I'll try to keep an eye on it.


what heuristics or AI is being used for blocking your spider? If your spider appears human or organuc it will not be blocked right?

Is this an issue of rate limiting, or request cadence? could you add randomness to the intervals in which you request the page?

Is it more complicated? do they use other signals to ascertain if you are a script or not like checking data from the browser (similar signals to the kind of things browser fingerprinting uses... e.g. screen res, user agent, cache availability, etc...) would it be possible for the browser to spoof this information?

I imagine rate limiting the IP address is the major issue... but could you not bounce the request through a proxy network? I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.

Perhaps you could use the users of your search engine as a proxy network through which to bounce the request for the scrape/indexing... This way the requests would look like they were coming from any of your users instead of one spiders ip address...Im not sure how cloudflare or any other reverse proxy could determine that thise requests were organic or not...

id be ok with contributing to a distributed search service so long as my cpu was not making requests to illegal content, and there were constraints put on the resource usage of my machine.

Sorry if this came off as all over the place, I do not know too much about the offense vs defense of scraping. These are just some thoughts...


> I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.

That's because all the TOR entry/exit nodes and relays IPs addresses are public [1].

[1] https://metrics.torproject.org/rs.html#toprelays


It's much more expensive now to build a large index (50B+ pages)

Do you have a cost estimate? Also could you be more selective in indexing, e.g. by having users requests sites to be crawled.


Requiring users to know what sites they want in advance somewhat defeats the purpose of a search engine, no?


Not at all. You only have to fail the first request. It is an approach I took with my own attempt at a search engine way back! In fact I know personally that there is at least one patent out there that suggests initial 1st time request users being asked to provide the appropriate response as an efficient way to teach systems for future users.

Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant. Wikipedia might (if they don't already) want to make a similar suggestion for users to contribute when finding a low content/missing page.


> Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant.

The first request can also be called asynchronously, and display a message to the user that it is 'processing....'.


More often than not I have an idea which site a result might be on when I issue a query:

If I search for a news event it's a news site.

If I search an error message, I know the result is going to likely be stackoverflow, github issues or the forum of the library.

etc.

I don't think this strategy will get you all the way there, but it could be combined with other ways of curating sites to crawl.


since sites are so desperate to be indexed, doesn't it seem better to put the onus on them to announce themselves? it would be great if dns registries publshed public keys .. maybe they do in newer schemes?


That works once your search engine is more widely used, but not a lot of sites are going to register with a niche search engines. Many users on the other hand really want a search engine like this and would be willing to invest some time.


Certificate Transparency (CT) Logs are this.


Is there a way to get the results to be formatted for desktop?

It looks like the layout is hard-coded for a mobile browser, in portrait mode.


I just looked myself up in your search engine and I can confirm that it finds stuff old enough that google wouldn't find them (eg: and old patch I submitted on gnu savannah).

I tried looking up a game I'm interested in and the second results cluster from your search engine is a reddit thread about linux support for that game... I love this.

Great job!


The Internet is such a fabric of society that I think all nations should contribute to a one-truth index. Not owned by a corporate entity. Tell me I’m wrong and we can consider the alternative: startups of all types with a more even playing field.


What are your sources for hostnames to crawl?

I looked into it a long time ago and seem to remember there was a way to get access to registration records, but I imagine combining that with HTTP certificate transparency records would significantly increase your hostname list. Anything else?


This is great! I found something other engines do not pick up! apparently I signed an agile manifesto in 2010 https://agilemanifesto.org/display/000000190.html


I just tried it and the UI is kinda old and not mobile friendly but the English results I got were satisfying. Not the case for French though. I'll try again in the future, diversity in this landscape is important.


Re: crawling being too hard

Have you contributed your crawl data to common crawl?


> 2) Hardware costs are too high.

Which is why the next big search engine should be distributed: https://yacy.net.


"distributed" doesn't make things more hardware efficient... It literally always make them less efficient. If e.g : mastodon had the same number of users as Twitter it would use 10x the ressources for the same traffic.


Sure, but it does spread the costs among users and makes them more manageable. One guy shouldering the cost of a search index is less viable than letting users shoulder the costs. Some charge customers as a solution to this, and that works, but then they need a minimum revenue to continue, or have to monetize with investors which usually means changing direction and goals. The other option, letting people host portions of the index, spreads the cost out, and the product gets about as good (best case scenario) as it's utility to people.


No way to test it right away, demo peer 502-es.


You could search for other public-facing instances, e.g., http://sokrates.homeunix.net:6060.


I tried searching for an answer, but how do you get a site added to your directory? Who maintains it? Directories are a real PITA to maintain with any quality.


Regarding the Gatekeeper companies like Cloudflare, it sounds like anti-competitive behavior that could potentially be targeted with anti-trust legislation, correct?


Cloudflare functions kinda like a private security company. They don't go around blocking sites willy-nilly, site owners have to specifically choose to use their service (and maybe pay for it), configuring the bot blocking rules themselves.

That's not really Cloudflare's fault. Someone has to do it, whether it's them or a competitor or sys admins manually making firewall rules. Cloudflare just happens to be good enough and darned affordable, so many choose to use them.

Hosting costs for small site owners would be much more expensive without Cloudflare shielding and caching.


I've had extensively dealing with Cloudflare. They have a complex whitelisting system that is difficult to get on, and they also have an 'AI' system that determines if you should be kicked off that whitelist for whatever reason.

Furthermore, they give Google preferred treatment in their UIs and backend algos because it is the incumbent and nobody cares about other smaller search engines. So there's a lot of detail to how they work in this domain.

It's 100% Cloudflare's fault, and it's up to them to give everyone a fair shot. They just don't care. Also, you are overlooking the fact that Google is a major investor (and so is Bing and Baidu). So really this exacerbates the issue. Should Google be allowed (either directly or indirectly) to block competing crawlers from dowloading web pages?


It isn't up to them to give everyone a fair shot. That isn't what their customers actually want. Cloudflare aren't in the "fair shots for all search engines" business. They are in the "stop requests you don't want hitting your servers" business.


I'd argue that a level playing field and more competition in the search space is a good thing.


These are all great points.


No, I think it is partially Cloudflare's fault because they offer this service and make it easy to deploy. This shit has exploded with Cloudflare's popularity.

Nobody has to do it, but a lot of people will do it when they notice there's an easy way to do it. Cloudflare is very much an enabler of bad behavior here. Now a lot of sites just toggle that on without even thinking about collateral.


"targeted with anti-trust legislation"

Um, this is America. Every market is basically a trust, cartel, or monopoly.

And I don't know if you can hear that, but there is literally laughter in the halls of power. All the show hearings by congress on social media and tech companies only has to do with two things:

1) one political party thinking the other is getting an advantage by them

2) shaking them down for more lobbying and campaign donations

No one in the halls of power give two shits about competition. Larger companies mean larger campaign donations, and more powerful people to hobnob with if/when you leave or lose your political office.

Of course I think that breaking up the cartels in every major sector would lead to massive improvements: more companies is more employment, more domestic employment, more people trying to innovate in management and product development, more product choice, lower prices, more competition, more redundancy/tolerance to supply chain disruption, less corruption in government and possibly better regulation.

Every large company brazenly does market abuse up and to the point of one and only one limiter: the "bad PR" line. So I guess we have that.


Companies don't make campaign donations. The people "exposing" them are showing their employees making donations, and employees don't have the same interests as their employer.


it should be. there should be some sort of 'bots rights' to level the playing field. perhaps this is something the FTC can look into. but, as it is right now big tech continues to keep their iron grip on the web and i don't see that changing any time soon. big tech has all the money and controls access to all the data and supply chains to prevent anyone else from being a competitive threat.

look at linkedin (owned by microsoft unspiderable by all but google/bing). github (now microsoft using this to fuel its AI coding buddy, but if you try to spider this at capacity your IP is banned) facebook (unspiderable) .. the list goes on and on ..

and as you can see, data is required to train advanced AI systems, too. So big tech has the advantage there as well. especially when they can swoop in and corrupt once non-profit companies like openai, and make them [partially] for-profit.

and to rant on (yes, this is what i do :)) it very difficult to buy a computer now. have you tried to buy a raspberry pi or even a jetson nano lately? Who is getting preferred access to the chip factories? Does anyone know? Is big tech getting dibs on all the microchips now too?


At a theoretical level it looks like Cloudflare won't block search engine crawlers. The docs are very Google and Bing oriented and also oriented towards supporting their customers, not random new search engine crawler.

Cloudflare allows search engine crawlers and bots. If you observe crawl issues or Cloudflare challenges presented to the search engine crawler or bot, contact Cloudflare support with the information you gather when troubleshooting the crawl errors via the methods outlined in this guide.

https://support.cloudflare.com/hc/en-us/articles/200169806


No, it is not.

Cloudflare is giving it's customers what they want. They don't want all kinds of bots claiming to be search engines crawling their sites. Cloudflare isn't hurting cloudflare competitors by doing this. Cloudflare isn't hurting their customers by doing this. To repeat - most websites don't want lots and lots of crawlers. They want the 2 or 3 which matter and no more, because at some point it's difficult to tell what the crawler is doing... (is it a search engine???). They aren't obliged to help search engines. Even if Cloudflare wasn't offering this, bigger customers would roll their own and do.. more or less the same thing.


i would assume its mostly anti scraping protection which is mostly for privacy. you don't want to allow everyone scrap your website, pull and use your info. for example from fb, ig, LinkedIn, github, .... you can build a really big profiling db on people that way. so websites need to know you are a legit search engine first


people can still be targeted if that information is public. anti scraping sounds like security by obscurity


If you have customers, does that mean the incremental gain from an improved index costs too much to store? Or are you talking about computational costs?


it's both storage and computational. they go hand in hand.


what kind of index is Gigablast using? traditional inverted index like Lucene or something more esoteric?

I know Google and Bing both use weird data-structure like BitFunnel

https://www.microsoft.com/en-us/research/publication/bitfunn...


100% custom.


> Hardware costs are too high

I want to say - you don’t know what are talking about. But, it’ll be rude.

Hardware is much cheaper and powerful now compared to 2005.


You've said it and it is rude, what's the point of that first sentence except to spite him? I'm sure he's well aware of the price per capability trend since 2005, you don't code a search engine without knowing that. Could be the costs of servicing his free users and/or maintaining an ever-growing database/index that is costly - in spite of cheaper hardware on a relative basis.


the complexity of the search algorithm has also increased substantially since 2005 And, in 2005, a billion page index was pretty big. Now it's closer to 100 billion.


There were ~60B pages on Facebook in 2015 I think your numbers are outdated. - Google search SRE


What if you allowed trusted contributors to "donate" their browsing to your index?

AltaVista and Yahoo did that with browser plugins in the 90s.


I really love how the results organize multiple matching pages from the same domain. This is really cool.


I wanted to add my site to Gigablast, but it said it would cost 25 cents. How is this a good thing?


curious how you implemented the index, memory based or disk based? Either way you are right, HW costs are extremely expensive and you would need a lot of high RAM/high core count machines to return such a large index to the endusers in a low latency fashion.


Make sure to file complaints to any competition market authority you have in your country.


Oh my god! This works so much better than every Internet search engine I have tried.


Storing information about the pages you can't index, is also useful


I really like GigaBlast.

I wrote a "meta" search utility for myself that can query multiple search engines from the command line.^1 It mixes the results into a simplified SERP ("metaSERP"), optimised for a text-only browser, with indicators to show which search engine each result came from. The key feature is that it allows for what I might call "continuation searches". Each metaSERPs contains timestamps in its page source to indicate when searches were executed, as well as preformatted HTTP paths. The next search can thus pick up where the previous one left off. Thus I can, if desired, build a maximum-sized metaSERP for each query.

The reason I wrote this is because search engines (not GigaBlast) funded by ads are increasingly trying to keep users on page one, where the "top ads" are, and they want to keep the number of results small. That's one change from 2005 and earlier. With AltaVista I used to dig deep into SERPs and there was a feeling of comprehensiveness; leave no stone unturned. Google has gradually ruined the ability to perform this type of searching with their now secretive and obviously biased behind-the-scenes ranking procedures.

Why is there no way to re-order results according to objective criteria, e.g., alphabetical; the user must accept the search engines' ordering, giving them the ability to "hide" results on pages the user will never view or simply not return them. That design is more favorable to advertising and less favorable to intellectual curiosity.

Each metaSERP, OTOH, is a file and is saved in a search directory for future reference; I will often go back to previous queries. I can later add more results to a metaSERP if desired. I actually like that GigaBlast's results are different than other search engines. The variety of results I get from different sources arguably improves the quality of the metaSERP. And, of course, metSERPs can be sorted according to objective criteria.

This is, AFAIK, a different way of searching. The "meta-search engines" of yesteryear did not do "continuations", probably because it was not necessary. Nor was there en expectation that user would want to save meta-searches to local files. Users were not trying to minimise their usage of a website, they were not trying to "un-google".

Today's world of web search is different, IMO. There seems to be a belief that the operator of a search engine can guess what a user is searching for, that a user who sends a query is only searching for one specific thing, and that the website has an ad to match with that query. At least, those are the only searches that really matter for advertising purposes. Serendipitous discovery while perusing results is not contemplated in the design. By serendipitous discovery I do not mean sending a random query, e.g., adding an "I'm feeling lucky" button, which to me always seemed like a bad joke.

The only downside so far is I ocassionally have to prune "one-off" searches that I do not want to save from the search directory. I am going to add an indicator at search time that a search is to be considered "ephemeral" and not meant to be saved. Periodically these ephemeral searches can then be pruned from the search directory automatically.

1. Of course this is not limited to web search engines. I also include various individual site search engines, e.g., Github.


Wow, do you happen to have published your utility so that other people can play with it?


The problem is that (1) I am a minimalist and dislike lots of "features" and (2) I prefer extremely simple HTML that targets the links browser. Most users are probably using graphical, Javascript- and CSS-enabled browsers so while this may work great for me, it may be of little interest to others who have higher aesthetic expectations. Another problem is I prefer to write tiny shell scripts and small programs in C that can be used in such scripts. To be interesting to a wider audience, I would likely have to be re-write this in some popular language I do not care for.

If I see people on HN complain about how few results they get from search engines, then that could provide some motivation to publish. I am just not sure this is a problem for others besides me.

Many results I get from search engines are garbage. By creating a metaSERP with a much higher number of results overall, from a variety of sources, I believe I get a higher number of quality ones.


Well something like that would be interesting to a particular demographic. I prefer minimal aesthetic cruft as well, and like terminal stuff like links.

If you ever do decide to publish, be sure to post it here!


How much cash do you need?


did you ever try to raise funds? why/not? not accusing, just curious.

did you ever think, let me just focus on Italy-relevant results? or job search only? or some slice of search.


maybe just add small webpages into your index, dont bother yo execute JS and dont download any images.

The content quality will be higher and it's a lot cheaper.


Out of curiosity, why would not executing JavaScript or not downloading images equal higher content quality?


Why do you have a user account with a login?


Do you have some sort of PageRank?


how recent are your results? 1-2h? 1 day?


it's continually spidering. just not at a high rate. actually, back in the day i had real time updates while google was doing the 'google dance'. that caused quite a stir in the web dev community because people could see their pages in the index being updated in real time whereas google took up to 30 days to do it.


>Gigablast has teamed up with Imperial Family Companies

Associating with that crank (responsible for recent freenode drama) is very off-putting.


Oh no, you see he isn't responsible, it's everyone else! /s


I don't get it, what's the fuzz here?


The guy who took over Freenode styles himself as the crown prince of korea; IFC is his company.


The consistent theme every time this comes up is that dealing with the sheer weight of the internet is almost impossible today. SEO spam is hard to fight and the index gets too heavy. However, I wonder if this is a sign that we're looking at the problem wrong.

What if instead of even trying to index the entire web, we moved one step back towards the curated directories of the early web? Give users a search engine and indexer that they control and host. Allow them to "follow" domains (or any partial URLs, like subreddits) that they trust.

Make it so that you can configure how many hops it is allowed to take from those trusted sources, similar to LinkedIn's levels of connections. If I'm hosting on my laptop, I might set it at 1 step removed, but if I've got an S3 bucket for my index I might go as far as 3 or 4 steps removed.

There are further optimizations that you could do, such as having your instance not index Wikipedia or Stack Overflow or whatever (instead using the built-in search and aggregating results).

I'm sure there are technical challenges I'm not thinking of, and this would absolutely be a tool that would best serve power users and programmers rather than average internet users. Such an engine wouldn't ever replace Google, but I'd think it would go a long way to making a better search engine for a single user's (or a certain subset of users') everyday web experience.


I agree, I think we are looking at the problem wrong. And this is a very insightful comparison with the linkedin levels of connections idea. I am working on something with this. One thing to point out is that when we think of searching through information, we are searching though an information structure aka graph of knowledge. Whatever idea or search term we are thinking of is connected to a bunch of other ideas. All those connected ideas represent the search space or the knowledge graph we are trying to parse. So one way in the past people have tried to approach this is they try to make a predefined knowledge graph or an ontology around a domain. They try to set up the structure of how the information should be and then they fill in the data. The goal is to dynamically create an ontology., Idk if anyone has really figured this out. But, Palantir with Foundry does something related. They sorta dynamically create an ontology ontop of a company's data. This lets people find relationships between data and more easily search through their data. Check this out to learn more https://sudonull.com/post/89367-Dynamic-ontology-As-Palantir...


This might work well in some situations (e.g. research, development), however it would also increase the effect of echo chambers I think.


Possibly, but I'm not convinced.

Google's not exactly working against the echo chamber problem, and I think that's because to do so would be to work against its own reason for existing. There are two goals here that are fundamentally at odds with each other:

1) Finding what you're looking for.

2) Finding a new perspective on something.

A search engine's job is to address the first challenge: finding something that the user is looking for. The search engine might end up serving both needs if they're looking for a new perspective on something, but if these two goals ever come into conflict with each other the search engine does (and I would argue it should) choose the first goal. Failing to do so will just lead to people ignoring the results.


Part of the thing with echo chambers is that the search terms themselves can be indicative of a particular bubble. For example, there's a difference in the people that refer to the Bureau of Alcohol, Tobacco, and Firearms by the official initialism, "ATF", and those that use "BATF". There's a strong antigun control bent to the `"BATF" guns` query, compared to the `"ATF" guns` query.

If you're indexing forums or social media, the same site is going to give back the bubbled responses, possibly without the person even being aware they're in a bubble.

https://www.google.com/search?q=%22BATF%22+guns&client=safar...

https://www.google.com/search?q=%22ATF%22+guns&client=safari...


Kind of like when searching for "jew" on Google led to antisemitic websites, that's because jews usually prefer the term "jewish".

Interestingly, back then, Google was big on neutrality and refused to do anything, stating that it reflected the way people used the word. It was finally addressed using "Google bombing" techniques. Something that Google didn't care much about back them because of its low impact.


echo chambers are what most people want :)


echo chambers are what most people want :)


The retro idea of curation seems popular here but everybody forgets why it lost out in the first place. It just doesn't scale in the first place. Not to mention demands - people usually want tools which lower mental effort and are intuitive as opposed to ones which are precise but in an obtuse metric. Most would not find a hardware mouse that consisted of two keypads for X and Y coordinates and a left click and right click button very useful.

Similarly everyone maintaining your their own index is cumbersome overkill in redundancy, processing power, and human effort in return for a stunted network graph which is worse for all metrics people usually actually care about. In terms of catching on even "antipattern search engines" that attempt to create an ideological echo chamber would probably catch on better.

Short of search engine experiments/start up attempts the only other useful application I can see is "rude web-spidering" which deliberately disrespects all requests to not index pages left publicly accessible as search engines generally try not to be good tools for cracker wardriving for PR and liability reasons. It would be a good whitehat or greyhat tool as doors secured by politeness only are not secure.


I like the idea of a subset of the web, and for a niche purpose. Not sure about user-hosting.

Capital is the huge barrier to entry today:

Larry Page's genius was to extend google's tech, consumer-habit and PR barriers-to-entry into a capital-based advantage: massive geo server farms, giving faster responses. Consumers have a demonstrated huge preference for faster response.


I’ve often thought Alexa.com top n sites Would be a good starting point.


I wonder if we could use some kind of federation (ActivityPub?) to build an aggregate of the search indexes of a curated community. Something like a giant federated whitelist of domains to index.


That's basically what I'm doing with my search "site:reddit.com" I wonder if anyone at Google is aware of this trend and taking notes.


I estimate that about half of my searches have either site:reddit.com or site:news.ycombinator.com at the end. In fact, I have an autocomplete snippet on my Mac so I don't have to type all that.


FYI this is exactly what the hashbangs in DDG do!


Reddit is missing a huge opportunity by not improving their crappy search functionality.


What if we allow users to upvote and downvote search results. Too many downvotes and you get dropped from the index.


Companies will simply hire people or purchase bots to downvote their competitors and upvote themselves, and then an entire economy will develop around gaming search engine algorithms, so that eventually search results will be completely useless.

Basically, SEO. SEO is the real problem, not search engine algorithms. Those algorithms are a result of the arms race between Google and black-hat SEO BS. Remove SEO and search engines work just fine.


what you are suggesting would make the problem of echo chamber (bubble) worse than it is today!


Awkwardly complaints about echo chamber as a problem tends to not refer to feedback dynamics (crudely but disambiguating refered to as circle jerk) so much "People disagree with me, the nerve of them!". It is not viable to have parties A through Z sharing the same world and all having absolute control over all others. We see this same complaint every time modernation comes up, let alone the fundamentals of democracy.


Bubbles are great if you are on the outside looking in at how a specific group thinks. Bubbles are horrible if you are on the inside trying to explore your thoughts.


It's flawed from the get go if reddit is the basis.


As much as I like to hate on reddit (I'm a permanently suspended user), not every sub there is trash. There are some great subs there on very specific niche topics.


Badge of honour I'd say. What was your transgression?


Someone asked about the Hunter Biden files. I responded with g n e w s . c o m . It took a few weeks, but they finally found it and suspended me for it. Others they suspended for mentioning the news organization that mentioned gnews.


I'm a permanently suspended member too (permanent for technical reasons), and I have never posted on there.


Natural Language Processing is a pox on modern search engines. I suspect that Google et. al. wanted to transform their product into an answer engine that powers voice assistants like Siri and just assumed everyone would naturally like the new way better. I can't stand how Google is always trying to guess what I want, rather than simply returning non-personalized results solely based on exactly what I typed in the textbox.

While that may be good for most people, there is still a lot of power and utility in simple keyword-driven searches. Sadly, it seems like every major search engine has to follow Google's lead.


I think some NLP is strictly beneficial for a search engine. You may think "grep for the web" sounds like a good idea, but let me tell you, having tried this, manually going through every permutation of plural forms of words and manually iterating the order of words to find a result is a chore and a half.

Like, instead of trying

  PDP11 emulator
  PDP-11 emulator
  "PDP 11" emulator
  PDP11 emulators
  PDP-11 emulators
  "PDP 11" emulators
  PDP11 emulation
  PDP-11 emulation
  "PDP 11" emulation
Basic NLP can do that a lot faster without introducing a lot of problems.

I do think Google currently goes way overboard with the NLP. It often feels like the query parser is an adversary you need to outsmart to get to the good results, rather than something that's actually helpful. That's not a great vibe. However, I think the big problem isn't what they are doing, but how little control you have over the process.


I get that for general-purpose searches this is a good idea, but it would be nice if there was an easy way to disable this when you know you don't want it - for example, for most programming searches, if I type SomeAPINameHere the most relevant results will always be those that include my search term verbatim. I don't need Google to helpfully suggest "Did you mean Some API Name Here?", which will virtually always return lower-quality search results.

Early Google was a breath of fresh air compared to the stemming that its competitors at the time did, but nowadays even putting search terms in quotes doesn't seem to return the same quality of results for these types of queries that Google used to have.


I feel your pain. Two workarounds when Google gets it wrong are to put the term in quotation marks, or to enable Verbatim mode in the toolbelt. (I know various people have come up with ways to add "Google Verbatim" as a search engine option in their browser, or use a browser extension to make Verbatim enabled by default.)

Disclaimer: I work on Google search.


Both of these options are disappointing, in my experience. Verbatim mode seems weirdly broken sometimes (maybe it's overly strict), and quoting things is rarely enough to convince Google that you really want to search for exactly that thing and not some totally different thing that it considers to be a synonym.

One porridge is too hot and the other is too cold. I know Google could find a happy compromise here if it wanted to. In fact, I bet there's some internal-only hacked-together version that works this way and actually gives an acceptable user experience for the kind of people who have shown up to this thread to show their dissatisfaction.


Try this, go to Google and type in "eggzackly this".

Two results not containing "eggz" at all. Two results containing "eggzackly<punctuation>this" Two results containing "eggzackly" but missing "this".

Google Search is broken. It no longer does what it's directed, it just takes a guess. I suspect part of this is because someone decided that "no results found" was the worst possible result a search engine could give.


Googling that with the brackets I get results containing "eggzackly this" ranked 3, 4, 6 (your comment) and 7 whereas the others contain just eggzackly (or with the 'this' preceded by punctuation as you mention).

Therefor I don't see how your last sentence is the explanation (there are results), I've also happened to land on no results found sometimes with overly precise quoted queries (for coding errors mostly IIRC). But it is annoying that it doesn't seem stricktly enforced even when you want it to.


Google does go way overboard with "NLP". Starting at least 5 years ago there was a trend toward "similar" matching and search result quality nose-dived.

You can search for, say, "cycling (insert product category here)" and get motorcycle related results. Why? Because to google "cycling" = "biking" and "motorcycles" are "bikes", bob's your uncle, now you're getting hits for motorcycle products.

Every time I try to do a very specific search I can see from the search results how google tries to "help", especially if the topic is esoteric. The pages actually about the esoteric thing I'm searching for get drowned in a sea of SEO'd bullshit about a word/topic that is 1-2 degrees of separation from each other in a thesaurus. I'm sure someone at google is very, very proud of this because it increases their measure for search user satisfaction X percent.

It does this thesaurus crap even with words in quotes, which is especially infuriating.


Yeah. It's one of those things where it's invisible where it works and enraging when it doesn't. That's generally not a failure mode that's desirable. It at least should require extremely low failure rates to motivate.


"Basic NLP can do that a lot faster without introducing a lot of problems."

This is called "stemming" and is not sensibly approached with machine learning.


Of course, but stemming is a fairly basic technique in NLP, as is POS-tagging. NLP is not machine learning.


Modern NLP basically is machine learning


You can still do NLP without machine learning though, and a lot of the sorts of computational linguistics a search engine needs for keyword extraction and query parsing doesn't require particularly fancy algorithms. What it needs is fast algorithms, and that's not something you're gonna get with ML.


Stemming is not meaningfully a natural language processing technique, any more than arithmetic is a technique of linear equations.


At the very least, https://en.wikipedia.org/wiki/Natural_language_processing seems to disagree.

(So do I: NLP does not have to be machine learning/AI based)


Is it not the processing of natural language?


Would you call addition a system of linear equations?

No, you don't use the college senior label for the highschool freshman topic. You use the smallest label that fits.

It's string processing.

NLP is actually understanding the language. Stemming is simple string matching.

Playing the technicality game to stretch fields to encompass everything you think even marginally related isn't being thorough or inclusive; it's being bloated, and losing track of the meaning of the term.

Splitting on spaces also isn't NLP.


Stemming is a task specific to a natural language. You can't run an English stemmer on French and get good results, for example.

All NLP is, strictly speaking, more or less elaborate string matching.

> Splitting on spaces also isn't NLP.

String splitting can be, but it's a bit borderline. I'll argue you're in NLP territory if it doesn't split "That FBI guy i.e. J. Edgar Hoover." into four "sentences".


> NLP is actually understanding the language.

That's actually not an accepted terminology. There's, indeed, this:

  https://en.wikipedia.org/wiki/Natural-language_understanding
Not sure why are you so adamant that yours is the "true meaning", when NLP existed long before machine learning and AI were used for it. And even if not, every term can be defined differently, so it should be normal to have different institutions/people define NLP differently.


Semantic search requires NLP. So does the Q&A format the OP is complaining about. People conflate all things NLP to the latter, and forget about the former.


Maybe I'm not using the right qualifiers around the term NLP. The kind of NLP I was referring to is something like "Hey google, what is natural language processing?" and orienting the search around people asking questions in standard(ish) English like they would to another person.


That's known as Open Domain Question Answering[1] and is only a subset of NLP.

[1] https://www.pinecone.io/learn/question-answering/


NLP is very heavily integrated into search, so I don't think it's really possible to decouple them. But I agree the whole BonziBuddy thing they've got going now is annoying and it's especially unfortunate how it's replaced the search functionality. I'd have a lot more patience with it if I could choose this functionality when I wanted to ask a question.


I doubt they assumed it was better. I expect they did a ton of user testing and found that it was better for most people. And I'm sure it is. HN users are very much a niche audience these days.


Right. Bing switched to this method as well, as did Facebook, Twitter, Amazon, and pretty much every other company that has the ML resources to do this. They obviously had a good reason to do so, beyond assumptions.


What’s a pox?


Saying X is a pox on Y means saying X is bad for Y.

It originates from the disease 'the pox'.


a disease or plague


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: