Hacker News new | past | comments | ask | show | jobs | submit login
A search engine that favors text-heavy sites and punishes modern web design (marginalia.nu)
3441 points by Funes- 36 days ago | hide | past | favorite | 717 comments



Wow, that's awesome. Great work!

For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.

When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.


Good comparison. Reminds me of an analogy I like to make of today's web, which is it feels like browsing through a magazine store — full of top 10s, shallow wow-factoids, and baity material. I genuinely believe terrible results like this are making society dumber.


The context matters. I'd happily read "Top 10" lists on a website if the site itself was dedicated to that one thing. "Top 10 Prog Rock albums", while a lazy, SEO-bait title, would at least be credible if it were on a music-oriented website.

But no, these stories all come from cookie-cutter "new media" blog sites, written by an anonymous content writer who's repackaged Wikipedia/Discogs info into Buzzfeed-style copy writing designed to get people to "share to Twitter/FB". No passion, no expertise. Just eyeballs at any cost.


This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages. This produces the problem where instead of covering a topic and refining it over time, the incentive is to repackage it over and over again.

It reminds me of an annoyance I have with the Kindle store. If I wanted to find a book on, let's say, Psychology, there is no option to find all-time respected books of the past centenary. Amazon's algorithms constantly push to recommend the latest hot book of the year. But I don't want that. A year is not enough time to have society determine if the material withstands time. I want something that has stood the test of time and is recommended by reputable institutions.


This is just a guess, but I believe that they use machine learning and rank it by the clicks. I took some coursera courses and Andrew Ng sort of suggested that as their strategy.

The problem is that clickbait and low effort articles could be good enough to get the click, but low effort enough to drag society into the gutter. As time passes, the system is gamified more and more where the least effort for the most clicks is optimized.


It sounds like the problem is, the search engine has no way to measure satisfaction after the click.


But they have. or could have. At least Google (and to a smaller extend Microsoft), if you are using Chrome/Bing have exactly that signal. If you stay on the site and scroll (taking time, reading, not skimming) all this could be a signal to evaluate if the search result met your needs.


I've heard google would guess with bounce rate. Or another way, if the user clicks on LinkedIn website A, after a few moments keeps trying other linksw/related search. It would mean it was not valuable.


They tried to insight this with the "bounce rate"


> is that the algorithms prioritize newer pages over older pages.

They do? That would explain a lot - but ironically, I can't find a good source on this. Do you have one at hand?


It is pretty obvious if you search for any old topic that is also covered incessantly by the news. "royal family" is a good example. There's no way those news stories published an hour ago are listed first due to a high PageRank score (which necessarily depends on time to accumulate inbound links).


It depends on the content. The flip side is looking up a programming-related question and getting results from 2012.

I think they take different things into account based on the thing being searched.


Even your example would depend upon the context. There are many cases where a programming question in 2021 is identical to one from 2012, along with the answer. In those instances, would you rather a shallow answer from 2021 or an indepth answer from 2012? This is not meant to imply that older answers offer greater depth, yet a heavy bias towards recent material can produce that outcome in some circumstances.


If you're using tools/languages that change rapidly (like Kotlin, in my case), syntax from a few years ago will often be outdated.


Yes, yet there are programming questions that go beyond "how do I do X in language Y" or "how do I do X with library Y". The language and library specific questions are the ones where I would be less inclined to want additional depth anyhow, well, provided they aren't dependent upon some language or library specific implementation detail.


There are of course a variety of factors, including the popularity of the site the page is published on. The signals related to the site are often as important as the content on the page itself. Even different parts of the same site can lend varying weight to something published in that section.

Engagement, as measured in clicks and time spent on page, plays a big part.

But you're right, to a degree, as frequently updated pages can rank higher in many areas. A newly published page has been recently updated.

A lot depends on the (algorithmically perceived) topic too. Where news is concerned, you're completely right, algos are always going to favor newer content unless your search terms specify otherwise.

PageRank, in it's original form, is long dead. Inbound link related signals are much more complex and contextual now, and other types of signals get more weight.


Your Google search results show the date on articles do they not? If people are more likely to click on "Celebrity Net Worth (2021)" than "Celebrity Net Worth (2012)", then the algo will update to favour those results, because people are clicking on them.

The only definitive source on this would be the gatekeeper itself. But Google never says anything explicitly, because they don't want people gaming search rankings. Even though it happens anyway.


The new evergreen is refreshed sludge for bottom dollar. College kids stealing Reddit comments or moving around paragraphs from old articles. Or linking to linked blogs that link elsewhere.

It's all stamped with Google Ads, of course, and then Google ranks these pages high enough to rake in eyeballs and ad dollars.

Also there's the fact that each year, the average webpage picks up two more video elements / ad players, one or two more ad overlays, a cookie banner, and half a dozen banner/interstitials. It's 3-5% content spread thinly over an ad engine.

The Google web is about squeezing ads down your throat.


Really makes you wonder: you play whack a mole and tackle the symptoms with initiatives like this search engine. But the root of that problem and many many others is the same: advertising. Why don't we try to tackle that?


Perhaps a subscription-based search engine would avoid these incentives.


Let’s go a few levels deeper and question our consumption culture


Exactly.

The only reason people make content they aren't passionate about is advertising.


> This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages.

Actually that's not always the case. We publish a lot of blog content and it's really hard to publish new content that replaces old articles. We still see articles from 2017 coming up as more popular than newer, better treatments of the same subject. If somebody knows the SEO magic to get around this I'm all ears.


Amazon search clearly does not prioritize exact title matches.


Its the "healthy web" Mozilla^1 and Google keep telling their blog audiences about. :)

1 Accept quid pro quo to send all queries to Google by default

If what these companies were telling their readers was true, i.e., that advertising is "essential" for the web to survive, then how are the sites returned by this search engine for text-heavy websites (that are not discoverable through Google, the default search engine for Chrome, Firefox, etc.) able to remain online. Advertising is essential for the "tech" company middleman business to survive.


I'm not sure I agree with your example. It seems to me it is the exact same as a "Top ten drinks to drink on a rainy day" list. There's simply too many good albums and opinions differ, so a top ten would -just like the drinks- end up being a list of the most popular ones with maybe one the author picks to stir some controversy or discussion. In my opinion the world would be a smarter place if Google ranked all such sites low. Then we might at least get fluff like "Top ten prog rock albums if you love X, hate Y and listen to Z when no one is around" instead.


Google won't rank them low because they actually do serve an important purpose. They're there for people who don't really know what they want specifically, they're looking for an overview. A top 10 gives a digestible overview on some topic, which helps the searcher narrow down what they really want.

A "Top 10 albums of all time" post is actually better off going through 10 genres of popular music from the past 50 years and picking the top album (plus mentioning some other top albums in the genre) for each one.

That gives the user the overview they're probably looking for, whether those are the top 10 albums of all time or not. It's a case of what the user searched for vs what they actually really want.


"The best minds of my generation are thinking about how to make people click ads"


It's also possible that it's the other way around: a certain "common denominator" + algorithms that chase broad engagement = mediocre results.

The real trick would be some kind of engine that can aim just above where the user's at.


So did Tim Berners-Lee. He was vehemently opposed to people shoehorning images into the WWW, because he didn't want it to turn into the equivalent of magazines. Which, I believe, he shared the opinion of them making society dumber.

Appropriately enough, I couldn't find a good quote to verify that since Google is only giving me newspapers and magazines talking about Sir Tim in the context of current events. I do believe it's in his book "Weaving the Web" though.


> I genuinely believe terrible results like this are making society dumber.

You have to e causality reversed. Google results reflect the fact that society is dumb.


Google results reflect the fact that educating and informing people has low profit margins.


Or the distribution of people now online better reflects stupidity in the general population.


what I really want is a true AI to search through all that and figure out the useful truth. I don't know how to do this (and of course whoever writes the AI needs to be unbiased...)


>whoever writes the AI needs to be unbiased...)

I'm not sure the idea of a sentient being not having a bias is meaningful. Reality, once you get past the trivial bits, is subjective.


Isn't there a fundamental ML postulate that learning without bias is impossible?

Maybe not the same kind of bias we think of in terms of politics and such, but I wonder if there's a connection.


I didn't say the AI should be unbiased, just whoever writes it.

I want an AI that is biased to the truth when there is an objective one, and my tastes otherwise. (that is when asked to find a good book it should give me fantasy even though romance is the most popular genre and so will have better reviews)


I think that is the goal, it's just what we currently have is an AI that's like a naive child who is easily tricked and distracted by clickbait.


>an AI that's <snip> easily tricked and distracted by clickbait.

So, AIs are actually on par with most adults now? (Sorry)


Cool, it appears that the trend towards JS may be causing self-selection -- if a page has a high amount of JS, it is highly unlikely to contain anything of value.


True. Unfortunately many large corporate websites through which you pay bills, order tickets, etc. are becoming infested with JS widgets and bulky, slow interfaces. These are hard to avoid.


Conversely no software to install. Browser as a platform. Don’t have to boot to Windows to pay your bills with activex for example


The mostly JS-less web was fine, fast, and reliable 20 years ago and I never had ActiveX.

I hear stories about Flash and ActiveX but I literally never needed these to shop or pay bills online. Payments also didn't require scripts from a dozen domains and four redirects..


Yup, and taking payments online was awful but privacy was more of a thing. In South Korea ActiveX was required until recently. https://www.theregister.com/2020/12/10/south_korea_activex_c...


The platform isn't the problem. The problem is with the amount of code that does something other than letting you "pay bills, order tickets, etc.".


Huh. A weighted algorithm, somewhere between Google and the one linked, where you could subtract from sites by amount of JavaScript might be interesting.


If one could create an metric of ad to content ratio from the js used, I would guess that would be a nice differentiator too.


Browsers should be cherry picking the most compelling things that people accomplish with complex code and supporting them as a native feature. Maybe the Browser Wars aren’t keeping up anymore.


Was that ever in doubt?


However, when searching for "haskell type inference algorithm" I get completely useless results.


That query is too long apparently. But if you shorten to "haskell type inference", I think it delivers on its promise:

> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?


The search engine doesn't do any type of re-ordering or synonym stuff, it only tires to construct different N-grams from the search query.

So if you for example compare "SDL tutorial" with "SDL tutorials". On google you'd get the same stuff, this search engine, for better or worse doesn't.

This is a design decision, for now anyway, mostly because I'm incredibly annoyed when algorithms are second-guessing me. On the other hand, it does mean you sometimes have to try different searches to get relevant results.


I like this design decision. It pays you back for choosing your search terms carefully.


I’m not against a stemmer, actually, just against the aggressive concordances (?) that Google now employs, like when it shows me X in Banach spaces (the classical, textbook case) when I’m specifically searching for X in Fréchet spaces (the generalization I want to find but am not sure exists); of course Banach spaces and Fréchet spaces are almost exclusively encountered in the same context, but it doesn’t mean that one is a popular typo for the other! (The relative rarity of both of these in the corpus probably doesn’t help. The farcical case is BRST, or Becchi-Rouet-Stora-Tyutin, in physics, as it is literally a single key away from “best” and thus almost impossible to search for.)

On the other hand, Google’s unawareness of (extensive and ubiquitous) Russian noun morphology is essentially what allowed Yandex to exist: both 2011 Yandex and 2021 Google are much more helpful for Russian than 2011 Google. I suspect (but have not checked) that the engine under discussion is utterly unusable for it. English (along with other Germanic and Romance languages to a lesser extent) is quite unusual in being meaningfully searchable without any understanding of morphology, globally speaking.


I thought you could fix that by enclosing "BRST" in quotes, but apparently not. DuckDuckGo (which uses Google) returns a couple of results that do contain "BRST" in a medical context, but most results don't contain this string at all. What's going on?


I’m not certain what DDG actually uses (wasn’t it Bing?), but in my experience from the last couple of months it ignores quotes substantially more eagerly than Google does. For this particular term, a little bit of domain knowledge helps: even without quotes, brst becchi, brst formalism, brst quantization or perhaps bv brst will get you reasonable results. (I could swear Google corrected brst quantization to best quantization a year ago, but apparently not anymore.) Searching for stuff in the context of BRST is still somewhat unpleasant, though.

I... don’t think anything particularly surprising is happening here, except for quotes being apparently ignored? I’ve had it explained to me that a rare word is essentially indistinguishable from a popular misspelling by NLP techniques as they currently exist, except by feeding the machine a massive dictionary (and perhaps not even then). BRST is a thing that you essentially can’t even define satisfactorily without at the very least four years of university-level physics (going by the conventional broad approach—the most direct possible road can of course be shorter if not necessarily more illuminating). “Best” is a very popular word both generally and in searches, and the R key is next to E on a Latin keyboard. If you are a perfect probabilistic reasoner with only these facts for context (and especially if you ignore case), I can very well believe that your best possible course of action is to assume a typo.

How to permit overriding that decision (and indeed how to recognize you’ve actually made one worth worrying about without massive human input—e.g. Russian adjectives can have more than 20 distinct forms, can be made up on the spot by following productive word-formation processes, and you don’t want to learn all of the world’s languages!) is simply a very difficult problem for what is probably a marginal benefit in the grand scheme of things.

I just dislike hitting these margins so much.


It would not be a difficult problem if they allowed the " " operator to work as they claim it does, or revive the + operator.


In English, maybe; in Russian, I frequently find myself reaching for the nonexistent “morphology but not synonyms” operator (as the same noun phrase can take a different form depending on whether it is the subject or the object of a verb, or even on which verb it is the object of); even German should have the same problem AFAIU, if a bit milder. I don’t dare think about how speakers of agglunative languages (Finnish, Turkish, Malayalam) suffer.

(DDG docs do say it supports +... and even +"...", but I can’t seem to get them to do what I want.)


Ah, OK. I don’t know anything about Russian. This is a hard problem. I think the solution is something like what you suggest: more operators allowing different transformations. Even in English, I would like a "you may pluralize but nothing else" operator.


Well it’s not that alien, it (along with the other Eastern Slavic languages, Ukrainian and Belarusian) is mostly a run-of-the-mill European language (unlike Finnish, Estonian or Hungarian) except it didn’t lose the Indo-European noun case system like most but instead developed even more cases. That is, where English or French would differentiate the roles of different arguments of a verb by prepositions or implicitly by position, Russian (like German and Latin) has a special axis of noun forms called “case” which it uses for that (and also prepositions, which now require a certain case as well—a noun form can’t not have a case like it can’t not have a number).

There are six of them (nominal [subject], genitive [belonging, part, absence, “of”], dative [indirect object, recipient, “to”], accusative [direct object], instrumental [device, means, “by”], prepositional [what the hell even is this]), so you have (cases) × (numbers) = 6 × 2 = 12 noun forms, and adjectives agree in number and gender with their noun, but (unlike Romance languages) plurals don’t have gender, so you have (cases) × (numbers and genders) = 6 × (3 + 1) = 24 adjective forms.

None of this would be particularly problematic, except these forms work like French or Spanish verbs: they are synthetic (case, number and gender are all a single fused ending, not orthogonal ones) and highly convoluted with a lot of irregularities. And nouns and adjectives are usually more important for a web search than verbs.


> BRST, or Becchi-Rouet-Stora-Tyutin is literally a single key away from “best” and thus almost impossible to search for.

Hmm I seem to be getting only relevant results, no "best", not sure what you mean. Are you not doing verbatim search?

https://www.google.com/search?q=brst&tbs=li:1


English is more the outlier in regard to Germanic languages, try German or Finnish, with their wonderful compounds :)

https://e.humanities.uva.nl/publications/2004/kamp_lang04.pd...


Well yeah, English is kind of weird, but Finnish isn’t a Germanic language at all? It’s not even Indo-European, so even Hindi is ostensibly closer to English than Finnish. I understand Standard German (along with Icelandic) is itself a bit atypical in that it hasn’t lost its cases when most other Germanic languages did.

Re compounds, I expected they would be more or less easy to deal with by relatively dumb splitting, similar to greedy solutions to the “no spaces” problem of Chinese and Japanese, and your link seems to bear that out. But yeah, cheers to more language-specific stuff in your indexing. /s


Gaaah, brain fart - you're right, of course, dunno why I included it.


Maybe list the synonyms under the query, so its easier to try different formulations.


Oh this sounds like it could be a really cool idea! This way it could also be subtly teaching users that the engine doesn't do automatic synonyms translation so it's worth experimenting; also kinda like giving the synonyms feature while still keeping user in full control.


Don't change it. It's good this way.


It could simply become an option.


Since it does not use synonyms, it looks like it is unable to answer "how's that thing called"-queries.


It would be nice if we could pipe search engines.


Definitely; We could create a meta search engine that queries them all, in desktop application format.

Let's name it after a famous old scientist, and maybe add the year to prove it's modern: Galileo 2021.


Meta search engines leave a bad taste in everyone's mouth because they've always failed. Here is why

https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theore...

You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.


> You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.

I am skeptical of this application of the theorem. Here is my proposal:

Take the top 10 Google and Bing results. If the top result from Bing is in the top 10 from Google, display Google results. If the top result from Bing is not in the top 10 from Google, place it at the 10th position. You'd have an algorithm that ties with Google, say 98% of the time, beats it say, 1.2% of the time, and loses .8% of the time.


Right. Arrow's theorem just says it's impossible to do it in all cases. It's still quite possible to get an improvement in a large proportion of cases, as you're proposing.


I've had jobs tuning up the relevance of search engines with methods like

https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...

and the first conclusion is "something that you think will improve relevance probably won't"; the TREC conference went for about five years before making the first real discovery

https://en.wikipedia.org/wiki/Okapi_BM25

It's true that Arrow's Theorem doesn't strictly apply, but thinking about it makes it clear that the aggregation problem is ill-defined and tricky. (e.g. note also that a ranking function for full text search might have a range of 0-1 but is not a meaningful number, like a probability estimate that a document is relevant, but it just means that a result with a higher score is likely to be more relevant than one with a lower score.)

Another way to think about it is that for any given feature architecture (say "bag of words") there is an (unknown) ideal ranking function.

You might think that a real ranking function is the ideal ranking function plus an error and that averaging several ranking functions would keep the contribution of the ideal ranking function and the errors would average out, but actually the errors are correlated.

In the case of BM25 for instance, it turns out you have to carefully tune between the biases of "long documents get more hits because they have more words in them" and "short documents rank higher because the document vectors are spiky like the query the vectors". Until BM25 there wasn't a function that could be tuned up properly and just averaging several bad functions doesn't solve the real problem.


Arrows theorem simply doesn't apply here. We don't need our personalized search results to satisfy the majority.


But in both cases you face the problem of aggregating preferences of many into one. In one case you are combining personal preferences in the other case aggregating ‘preferences’ expressed by search engines.


But search engines aren't voting to maximize the chances that their preferred candidate shows up on top. The mixed ranker has no requirement to satisfy Arrows integrity constraints. It has to satisfy the end user, which is quite possible in theory.

Conditions the mixed ranker doesn't have to satisfy "ranking while also meeting a specified set of criteria: unrestricted domain, non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives"


Sure, but the problem that conventional IR ranking functions are not meaningful other than by ordering leads you to the dismal world of political economy where you can't aggregate people's utility functions. (Thus you can't say anything about inequality, only about Pareto efficiency)

Hypothetically you could treat these functions as meaningful but when you try you find that they aren't very meaningful.

For instance IBM Watson aggregated multiple search sources by converting all the relevance scores to "the probability that this result is relevant".

A conventional search engine will do horribly in that respect, you can fit a logit curve to make a probability estimator and you might get p=0.7 at the most and very rarely get that, in fact, you rarely get p>0.5.

If you are combining search results from search engines that use similar approaches you know those p's are not independent so you can't take a large numbers of p=0.7's and turn that into a higher p.

If you are using search engines that use radically different matching strategies (say they return only p=0.99 results with low recall) the Watson approach works, but you need a big team to develop a long tail of matching strategies.

If you had a good p-estimator for search you could do all sorts of things that normal search engines do poorly, such as "get an email when a p>0.5 document is added to the collection."

For now alerting features are either absent or useless and most people have no idea why.


That's an invalid application of this theorem. (It doesn't necessarily hold)

Suppose there's an unambiguous ranked preference by all people among a set (webpages, ranking). Suppose one search engine ranks correctly the top 5 results and incorrectly the next 5 results, while another ranks incorrectly the top 5 and correctly the next 5.

What can happen is that some there may be no universally preferred search engine (likely). In practice, as another commenter noted, you can also have most users prefer more a certain combination of results (that's not difficult to imagine, for example by combining top independent results from different engines for example).


... is this Galileo 2021 a reference that I am not understanding?


Yup, but so far no one got it.

There was such an app in the early 2000's, before Google went mainstream, and Altavista-like engines were not good: Copernic 2000.

I guess I'm officially old now.


For years I wanted to try Copernic Summarizer. It seemed like it actually worked. Then software that did summaries disappeared, maybe? And about 5 years ago bots on Reddit were doing summaries of news stories (and then links in comments).

This is a pattern I see over and over again, some research group or academics show that something can be done (summaries that make sense and are true summaries, evolutionary algorithm FPGA programming, real time gaze prediction, etc) and there's a few published code repos and a bit of news, then 'poof' - no where to be seen for 15 years or more.


FWIW, I got the reference. Maybe I'm old too?


I was always a dogpile user :p


Hotbot!


Brave browser currently has "Google fallback" which sometimes mixes in Google search results with Brave's own search engine.

https://search.brave.com/help/google-fallback


Not an app, but probably comes quite close in all other respects: https://metager.org


I need that with a simpler interface, so I call it after a famous dedective: Sherlock.


a magic pop sound is faintly audible as a new side project is appended to several lists Excellent, thank you!


Likely trademark collision with this: https://www.galileo.usg.edu/


As long as few people use it, it will be great. Rest assured that the moment it becomes popular, the people who want to game it will appear.


This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.

Still, the best way to break SEO is to have actual competition in the search space. As long as SEO remains focused on Google there is an opportunity for these companies to thrive by evading SEO braindamage.


That sort of recipe blog hasn't happened just for SEO. It's also a bit of a "two audiences" problem: if you are coming to that food blogger from a search you certainly would prefer the recipe first and then maybe any commentary on it below if the recipe looks good. If you are a regular reader of that food blogger you are probably invested in the stories up top and that parasocial connection and the recipes themselves are sometimes incidental to why you are a regular reader.

You see some of that "two readers" divide sometimes even in classic cookbooks, where "celebrity" chefs of the day might spend much of a cookbook on a long rambling memoir. Admittedly such books were generally well indexed and had table of contents to jump right to the recipes or particular recipes, but the concept of "long personal ramble of what these recipes mean to me" is an old one in cookbooks too.


I see your point, but argue you've misidentified the two audiences.

One audience matches your description and is the invested reader. They want that blogger's story telling. they might make the recipe, but they're a dedicated reader.

The other audience is not the recipe-searcher, but instead Google. Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on. They won't even remember the blog's name. So the site isn't optimized for them. It's optimized for Google.

"Slow the parasitic recipe-searcher down. They're leeches, here for a freebie. Well they'll pay me in Google Rank time blocks."


> Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on.

This is not entirely true, though. If a randomly found recipe turns out particularly good, I'll bookmark the site and try out other dishes. It's a very practical method to find particularly good* recipe collections.

*) In this case "good" means what you need - not just subjectively "tasty", but e.g. low cost, quick to prepare, low calorie or in line with a particular diet and so on.


I'm aware of zero human-behavior-truisms that are "entirely true".


> If you are a regular reader of that food blogger

I think this assumes facts not in evidence. It certainly seems like an overwhelming number of "blogs" are not actual blogs but SEO content farms. There's no regular readers of such things because there's no actual authors, just someone that took a job on Fivver to spew out some SEO garbage. Old content gets reposted almost verbatim because new results better according to Google.

The only reason these "blogs" exist is to show ads and hopefully get someone's e-mail (and implied consent) for a marke....newsletter.


I know at least a few that I commonly see in top search results that I have friends that read them like personalized soap operas where most of the drama revolves around food and family and serving food to family.

It's at least half the business models of Food Network shows: aspirational kitchens and the people that live in them and also sometimes here's their recipes. (The other half being competitions, obviously.) I've got friends that could deliver entire doctoral theses on the Bon Appetit Test Kitchen (and its many YouTube shows and blogs) and the huge soap operatic drama of 2020's events where the entire brand milkshake ducked itself; falling into people's hearts as "feel good" entertainment early in 2020/the pandemic and then exploding very dramatically with revelations and betrayals that Fall.

Which isn't to say that there aren't garbage SEO farms out there in the food blogging space as well, but a lot of the big ones people commonly complain about seeing in google's results do have regular fans/audiences. (ETA: And many of the smaller blogs want to have regular fans/audiences. It's an active influencer/"content creator" space with relatively low barrier to entry that people love. Everyone's family loves food, it's a part of the human condition.)


I've basically never been taken to a recipe without a rambling preamble from Google. While food blogs may serve two audiences, a long introduction seems to be a requirement to appear in the top Google search results.


Personally, I think that has a lot more to do with the fact that Google killed the Recipe Databases. There did used to be a few startups that tried to be Recipe Aggregators with advertising based business models, that would show recipes and then link to source blogs and/or cookbooks, and in the brief period where they existed Google scraped them entirely and showed entire recipes on search results and ate their ad revenue out from under them.


Such databases would get battered by demands to remove content these days, if not already back then. No one want a database listing their stuff for ad revenue like that because many wouldn't follow the links so see their adverts or be subject to their tracking.

A couple of browser add-ons specifically geared around trimming recipe pages down have been taken down due to similar complaints.


That is a really bad thing by Google. Their core business is not recipes.


Their core business is making money from other people’s content, no matter what it is.


Their core business is advertising and they have always been in a direct conflict-of-interest by competing with content sites for ad revenue buys.


That's why I use Saffron [1], it magically converts those sites into a page in my recipe book. I found it when the developer commented here in HN. Also, a lot of cooking website have started to add a link with "jump to recipe" functionality allowing you to skip all the crap.

[1] https://www.mysaffronapp.com/


There's also https://based.cooking.


I've noticed this pattern start to pop up elsewhere. I've started to train my skimming skills, skipping a paragraph or two at a time to get past the fluff.

Like an article about some current event will undoubtedly begin with "when I was traveling ten years ago...".


It's also because that's a way of trying to copyright protect recipes, which are normally not copyright protected.

> “Mere listings of ingredients as in recipes, formulas, compounds, or prescriptions are not subject to copyright protection. However, when a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a combination of recipes, as in a cookbook, there may be a basis for copyright protection.”


But that copyright protection only extends to the literary expression. The recipe itself is still not covered by copyright, even if accompanied by an essay.


That's not really for SEO, which favors readily accessible information.

That's ads. When mobile users have to scroll past 10 add, theyll click on some of them and make the blog money.


Searching for ‘chocolate’ on this search engine turned up a surprisingly large amount of chocolate based recipes.


>> This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.

I continue to be curious about this kind of complaint. If all you want is a recipe list, without any of the fluff, why would you click on a link to a blog, rather than on a link to a recipe aggregator?

Foodie blogs exist specifically for the people who want a foodie discussion and not just an ingredients' list.

Is it because blogs tend to have better recipes overall? In that case, isn't there a bit of entitlement involved in asking that the author self-sacrificingly provides only the information that you want, without taking care of their own needs and wants, also?


I think the complaint is that those blogs rank higher than nuts-and-bolts recipes now. It wasn't that way a few years ago. Yes, scrolling down the results to Food Network or Martha Stewart or whatever is possible, as is going directly to those sites and using their site search, but it's noticeable and annoying.


Not my experience. For a very quick test, I searched DDG for "omelette recipe, "carbonara recipe" and "peking duck recipe" (just to spice it up a bit) and all my top results are aggregators. Even "avgolemeono recipe" (which I'd think is very specialised) is aggregators on top.

To be honest, I don't follow recipes when I cook unless it's a dish I've never had before. At that point what I want is to understand the point of the dish. A list of ingredients and preparation instructions don't tell me what it's supposed to taste and smell like. The foodie blogs at least try to create a certain... feeling of place, I suppose, some kind of impression that guides you when you cook. I wouldn't say it always works but I appreciate the effort.

My real complaint with recipe writers is that they know how to cook one or two dishes well and they crib the rest off each other so even with all the information they provide, you still can't reliably cook a good meal from a recipe unless you've had the dish before. But that's my personal opinion.


Because when you search for a recipe you get the link to the blog, not the aggregator.


It's the same thing that people always complain about. This thing is not in a format that I like, so it must be not what anyone likes.

If you want JUST recipes, pay money instead of just randomly googling around. America's test kitchen has a billion, vetted, and really good recipes. That solves that problem.


I don't think the existing media-heavy websites are gaming Google to rank higher. It's that Google itself prefers media heavy content; they don't have to "game" anything.

I also think a search engine like this would be quite hard to game. An ML-based classifier trained on thousands of text-heavy and media-heavy screenshots should be quite robust and I think would be very hard to evade, so the "game" will become more about how identify the crawler so you can serve it a high-ranking page while serving crap to the real users, and it seems fairly easily to defeat if the search engine does a second pass using residential proxies and standard browser user agents to detect this behavior (it could also threaten huge penalties like the entire domain being banned for a month to even deter attempts at this).


With the advances in text generation by machines that looks, but isn't quite accurate (aka GPT-3), seems like it would be easily gamed (given access to GPT-3). Even without GPT-3, if the content being prioritized is mere text, I'm sure that for a pile of money, I could generate something that looks like Wikipedia, in the sense that it's a giant pile of mostly text, but it would make zero sense to a human reader. (Building an SEO farm to boost ranking of not-wikpedia is left as an exercise for the reader.)


Specialization mostly a problem in monocultures.

If you almost only plant wheat, you are going to end up with one hell of a pest problem.

If you almost only have Windows XP, you are going to have one hell of a virus problem.

If you almost only have SearchRank-style search engines (or just the one), you are going to have one hell of a content spam problem.

Even though they have some pretty dodgy incentives, I don't think google suffers quality problems because they are evil, I think ultimately they suffer because they're so dominant. Whatever they do, the spammers adapt almost instantly.

A diverse ecosystem on the other hand limits the viability of specialization by its very nature. If one actor is attacked, it shrinks and that reduces the opportunity for attacking it.


If there were a wider variety of popular search engines, with different ranking criteria, would sites begin to move away from gaming the system? Surely it would be too hard to game more than one search engine at a time?


It would be a matter of numbers anyway about which they optimize for. A/B testing is already in place and doesn't care about where it comes from, just which one does better.


There should be some perfect balance where this search engine is N% as popular as Google, where Google soaks up all of the gamifiers, but this search engine is still popular enough to derive revenue and do ML and other search-engine-useful stuff.


> the people who want to game it will appear.

So just add human review to the mix, if a site is obviously trying to game the system (listicles, seo spam etc) just drop and ban them from the search index.


Congratulations, you've just invented negative SEO.


>followed by a listicle "8 Reasons Why Rome Fell"

but arent you curious about the 7th reason? it will surprise you!


You wont believe how Claudius looks today!


Doctors HATE him!!!


Search engines whose revenue is based on advertising will ultimately be tuned to steer you to the ad foodchain. All the incentives are aligned towards and all the metrics ultimately in service of, profit for advertisers. Not in the 99% of people who can convinced to consume something by ads? Welp, screw you.


Search engines should be something you pay for. Surely search engine powerusers can afford to pay for such a service. If Google makes $1 per user per month or something, that's not too high a bar to get over.


Search engines should be like libraries. At least some tiny sliver of the billions we spend on education and research should go to, you know, actually organizing the world's information and making it universally available.


I see another issue here: companies like Google prioritize information to 1) keep their users and 2) maximize their profit.

If you move data organization to another type of organization (non-profit, state, universities - private or public), then the question of data prioritization becomes highly political. What should be exposed? What should not? What to put first? ...

It is already, but to a smaller extend since money-making companies have little interest in data meaning, and high interest in the commercial value of their users.


In which case, consider paying for something like Infinity: https://infinitysearch.co/


The theoretical cap for this, if you include every human being on planet Earth, is 7 billion/month. This translates into $84 billion annual revenue.

Google's revenue last year was 146 billion, and it operates not anywhere near the theoretical maximum. Most of that revenue is advertisement.


The Wikipedia link at the top is always given. It would maybe be good to make it a little clearer that it's not one of the true results.


I think this is just because of terms you have searched. In my test-searches Wikipedia has not come up once in first position (i think the highest was 3rd in the list).

Here's what I've tried with a few variations: golang generics proposal, machine learning transformer, covid hospitalization germany

[edit] formatting


I think maybe it's a special insert at the top, but only if a Wikipedia page is found that matches you search term? I'm not sure now though.


Imagine if you were looking for the movie.


The you'd use a different search engine. Why does everything have to be a Swiss Army knife?


Or you could just search for 'rome movie'. Though for more complex disambiguation you would need to resort to, e.g. schema.org descriptions (which are supported by most search engines, and the foundation for most "smart" search result snippets).


That's a fair point. This engine would be useful if you need grep over internet (by without regexes), i.e. when you want to find the exact phrases. But that's a relatively narrow use case.


I tend to prefer Wikipedia for movies. The exception is actor headshots if I'm trying to identify someone, which Wikipedia lacks for licensing reasons, but otherwise Wikipedia tends to be better than IMDB for most needs. Wikipedia has an IMDB link on every article anyway.

Another need I guess might be reviews, for which RT or MC are better than IMDB: not sure if either of those two will fare better than IMDB in this search engine but again Wiki has links out (in addition to good reception summaries)


For me, imdb was much better when they had user comments/discussion.

I never even posted on it myself, but browsing the discussions one could learn all sorts of trivia, inside info, speculation, etc about each movie.

Since they (inexplicably) killed that feature, I rarely even visit anymore. Your right, for many purposes wikipedia is better, especially for TV series episode lists with summaries.


IMDB management thought it was their brilliant editorial work that drew people to their site. Morons. It was the comments all along. Of course they also believed they could create gravity-free zones by sheer force of executive will (and maybe still do).


Especially for old and lesser known movies, the discussion board for the movie was a brilliant addition that could give the movie an extra dimension. Context is very important in order to understand, and ulitmately enjoy something.

I think they removed it in part because new movies, like star wars and superhero movies, had alot of negative activity.


I find IMDb to be more convenient than RT/MC/Wikipedia for finding release dates of movies - nearly every other website lists only the American release date, maybe one or two others if the movie was disproportionately popular in certain regions.


Imagine including the search term "movie".


That doesn't do anything useful.


?q=imdb.com:fall of the roman empire


!imdb


Wow I used "personality test" and actually got useful articles about personality theory. I'll actually use this!


I think it's a case where systems diversity can be an advantage. Much like how most malware was historically written for Windows and could be avoided by using Linux, the low-quality search engine bait is created for Google and can be avoided by using a different style of search engine.


No one mentioned the "bonus" audio in the page source: https://www.youtube.com/watch?v=7fCifJR6LAY


;-)


Interesting choice of search topic. Are you trying to make an additional point?


I tried some queries for Harry Potter fanfictions, and the results were pretty much completely unrelated. There weren’t that many results, either.


I'm curious what you searched for.

https://search.marginalia.nu/search?query=harry+potter+fanfi...

This seems to return a pretty decent number of sites relating to that (as well as some sites not relating to that).

The search engine isn't always great at knowing what a page is about, unfortunately.

This seemed to return mostly relevant results

https://search.marginalia.nu/search?query=%22harry+potter%22...


Yes, shorter queries return more relevant results. I think this was the first query that came to my mind:

https://search.marginalia.nu/search?query=Best+%22harry+pott...


Yeah, that's just not a type of query my search engine is particularly good at. It's pretty dumb, and just tries to match as much of the webpage against the query as it can.

This used to be how all search engines worked, but I guess people have been taught by google that they should ask questions now, instead of search for terms.

I wonder how I can guide people to make more suitable queries. Maybe I should just make it look less like google.


If this search engine ever takes off, the listicle writers will just start optimizing for it too, right?


Mission accomplished, then.


If the goal was to remove modern web design, ok sure mission accomplished.

If your goal was to create a search engine that ignored listicles and other fluff and instead got you meatier results like "academic talks" and such, then no.


When a measure becomes a target, it ceases to be a good measure.

https://en.wikipedia.org/wiki/Goodhart%27s_law


I had the exact opposite experience. I searched the site for "java", got a Wikipedia link first (for the island, not the programming language), and the 2nd result was to a random JEP page, and all the rest of the results were random tidbits about Java (e.g. "XZ compression algorithm in Java). Didn't get any high level results pointing to an overview of the language, getting started guides, etc.


You need to use some old school search techniques and search for “Java overview”


I'm not sure that's a bad thing.


well, they're results to java related items...

What kind of links where you expecting to find?


I did a search for "George Washington"

First result after Wikipedia:

"Radiophone Transmitter on the U.S.S. George Washington (1920)

In 1906, Reginald Fessenden contracted with General Electric to build the first alternator transmitter. G.E. continued to perfect alternator transmitter design, and at the time of this report, the Navy was operating one of G.E.'s 200 kilowatt alternators http://earlyradiohistory.us/1919wsh.htm "

Another result in the first few:

" - VANDERBILT, GEORGE WASHINGTON

PH: (800) ###-#233 FX: (#03) 641-5###. https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_... "

And just below that terrible result:

"I Looked and I Listened -- George Washington Hill extract (1954)

Although the events described in this account are undated, they appear to have occurred in late 1928. I Looked and I Listened, Ben Gross, 1954, pages 104-105: Programs such as these called for the expenditure of larger sums than NBC had anticipated. It be http://earlyradiohistory.us/1954ayl2.htm "

Dramatically worse than Google.

---

Ok, how about a search for "Rome" then? Surely it'll pull some great text results for the city or the ancient empire.

First result after Wikipedia:

"Home | Rome Daily Sentinel

Reliable Community News for Oneida, Madison and Lewis County http://romesentinel.com/"

The fourth result for searching "Rome":

"Glenn's Pens - Stores of Note

Glenn's Pens, web site about pens, inks, stores, companies - the pleasure of owning and using a pen of choice. Direcdtory of pen stores in Europe. http://www.marcuslink.com/pens/storesofnote/roma.html"

Again, dramatically worse than Google.

---

Ok, how about if I search for "British"?

First result after Wikipedia:

"BRITISH MINING DATABASE

British_Mining_Database http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "

And after that:

"British Virgin Islands

Many of these photos were taken on board the Spirit of Massachusetts. The sailing trip was organized by Toto Tours. Images Copyright © Lowell Greenberg Home Up Spring Quail Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout, Oregon Wahkeena http://www.earthrenewal.org/british_virgin_islands2.htm"

Again, far off the mark and dramatically worse than Google.

I like the idea of Google having lots of search competition, this isn't there yet (and I wouldn't expect it to be). I don't think overhyping its results does it any favors.


What were you expecting to see for British? There must be millions of pages containing that term. Anyway the first screenful from Google is unadulterated crap, advertising mixed with the usual trivia questions.

If you are going top claim something is wide of the mark then you really ought to tell us at least roughly where the mark is.


This is not a Google competitor, it's a different type of search engine with different goals.

> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?


I checked the results of the same query and they seem fine. Lots of speeches and articles about George Washington the US president. There's even his beer recipe.

As for the results you linked, it's part of the zeitgeist to list other entities sharing the same name. Sure, they could use some subtle changes in ranking, but overall the returned links satisfy my curiosity.


[flagged]


The project explicitly bills itself as a "search engine", not an "interesting and unexpected material surfacer". Moreover, projecting emotions like "angry" onto a comment in order to discredit the content of the comment (hey! is that an ad-hominem?) is just about exactly the opposite of the discussions that the HN mods are trying to curate, and the discussions that I like to see here.


If you click through to the About page, I think you'll see that "interesting and unexpected material surfacer" is a fairly apt description of the project.


I think in fairness that when "interesting and unexpected material surfacer" is merely a euphemism for "we didn't bother indexing the things you might actually be looking for", a degree of scepticism isn't unwarranted.

(Source: I looked up several Irish politicians because I run an all-text website containing every single word that they say in parliament. I got nothing of use, or even of interest, for anything.)


In the early days of google, I found what I was looking for on page 5+. On the way, I’d discover many interesting things I didn’t even know I was looking for, often completely unrelated to what I was searching for.


I miss those old days of even being permitted to go many pages in.


And now Google hides that more than one page even exists, as they populate their first page with buttons to ask similar questions and go to the first page of THOSE results.


> Hobby project leads angry person to interesting and unexpected material; angry person remains angry.

Not angry in the least. I'm thrilled someone is working on a search competitor to Google.

I understand you're attempting to dismiss my pointing out the bad results by calling me angry though. You're focusing your content on me personally, instead of what I pointed out.

The parent was far overhyping the results in a way that was very misleading (look, it's better than Google!). I tried various searches, they were not great results. The parent was very clearly implying something a lot better than that by what they said. The product isn't close to being at that level at this point, overhyping it to such an absurd degree isn't reasonable or fair to the person that is working on it.

I would specifically suggest people not compare it to Google. Let it be its own thing, at least for a good while. Google (Alphabet) is a trillion dollar company. Don't press the expectations so far and stage it to compete with Google at this point. I wouldn't even reference Google in relation to this search engine, let it be its own thing and find its own mindshare.


> I'm thrilled someone is working on a search competitor to Google.

Except the author goes to quite some lengths to explain that his search engine is not a competitor to Google, and is in fact exactly the opposite of Google in many ways: https://memex.marginalia.nu/projects/edge/about.gmi


Yeah, Google tends to send a lot of junk back.


Yeah so this is my project. It's very much a work in progress, but occasionally I think it works remarkably well for something I cobbled together alone out of consumer hardware and home-made code :-)


It's very rare that I see a project on HN I can see myself using. This is one. Like others have said, the results can be a little rough. But they're rough in a way I think is much more manageable than the idiosynchrosies of more 'clever' search engines.


I think you need to approach it more like grep than google. It's a forgotten art, dealing with this type of dumb search engine.

Like if you search for "How do I make a steak", you aren't going to get very good results. But a better query is "Steak Recipe", as that is at least a conceivable H1-tag.


This is exactly how I prefer to use my search engines.


I searched like this all my life and always got expected results.

But just a week ago I found out that these "how", "what" questions give better and faster results on Google.


That switch happened some years ago. I've been unlearning and relearning how to use google for what feels like at least three or four years now.

The main pain-point, though, is that a lot of long-tail searches you could've used to find different results in years past, now seem to funnel you to the same set of results based on your apparent intent. At least, it has felt that way -- I'm not entirely sure how the modern google algorithm works.


I realized this a few years ago when I observed my wife find things faster on Google than me.

I appreciate that it is easier for newcomers but I still hate it personally after years and especially that they cannot even avoid meddling with my queries even when I try to accept the new system and use the verbatim option.


Try your old Google Fu skills on DuckDuckGo (or Bing I guess). I've found it to have good results anyway


> I think you need to approach it more like grep than google. It's a forgotten art

A search engine that accepted regex as the search parameter would be amazing.

I actually used this method as a field filter for a bunch of simple internal tools to search for info. Originally people were asking for individual search capabilities, but I didn't want it to become a giant project with me as the implementor of everyone's unique search capability feature request - so I just gave them regex, encoded inputs into the URL query string so they can save searches - gave em a bunch of examples to get going and now people are slowly learning regex and coming up with their own "new features" :P

But this made sense because it's a relatively small amount of data, so small that it's searched in the front end which is why it's more of a filter... I don't think pure regex would scale when used as a query on a massive DB, it would need some kind of hierachy still to only bother parsing a subset of relevant text... unless there is some clever functional regex caching algorithm that can be used.


So, you are re-implementing Altavista, Lycos and other old search engines.

They used the naive approach: you searched for "steak", and they would bring the pages which included the word "steak".

The problem is that people could fool these engines by adding a long sequence like "steak, steak, steak, steak, steak, steak" to their site -- to pretend that they were the most authoritative page about steaks.

Google's big innovation was to count the referrers -- how many pages used the word "steak" to link to that particular page.

The rest is history.


Effective Google search is also history.

I understand they are trying to maximize ad revenue and search does work very well for people who are looking for products or services.

But it no longer works well for finding information that is even slightly obscure.


> The problem is that people could fool these engines by adding a long sequence like "steak, steak, steak, steak, steak, steak" to their site -- to pretend that they were the most authoritative page about steaks.

I don't see a lot of people investing in SEO to boost their Marginalia results.


> Google's big innovation was to count the referrers -- how many pages used the word "steak" to link to that particular page.

Then people fooled Google into showing the White House as top result when searching for "a miserable failure".

At the moment marginalia's approach of sorting pages into quality buckets based on lack of JS seems to be working extremely well, but of course it will be gamed if it gets popular.

However, I'd rather want SEO-crafting to consider itself with minimizing JS, rather than spamming links into every comment field on every blog across the globe ;-)


Love it, kudos! This is great for developers and others who Just Need Answers and not shopping or entertainment.

If you're looking for feedback, both from a UI design and utility standpoint, you might consider "inlining" results from selected sites, e.g. Wikipedia, stacked change, etc. Having worked on search for a long time, inlining (onebox etc) is a big reason users choose Google, and that channelers fail to get traction. If you're Serious(tm), dog into the publisher structure formats and format those, create a test suite, etc.

A word of caution: if this takes off, as a business it's vulnerable to Google shifting its algorithms slightly to identify the segment of users+queries who prefer these results and give the same results to those queries.

Hope this helps!


If Google starts showing interesting text-heavy links instead of vapid listicles and storefronts, I have accomplished everything I ever could dream of.


Google Info - for when you're looking for information, not shopping advice or lists!


Maybe you're joking, but this is a good idea for search engine. Better: Credible info.


Google info? Can you give me a sample query of what you mean?


It was a joke. The joke being that Google should launch at new product called Google Info, that would actually give you information when you search.


Haha, reminds me exactly of this.

https://xkcd.com/810/


You can check the Web Vitals score of Google SERP-s using Core SERP Vitals (https://chrome.google.com/webstore/detail/core-serp-vitals/o...) and filter out the worst results.


Thank you for doing this important work.


haha, great answer! thanks for your work on this :)


Very cool project! How many websites do you have in your index? And how did you go about building it?

I've been working on an engine for personal websites, currently trying to build a classifier to extract them from commoncrawl, if you have any general tips on that kind of project they'd be very welcome.


About 21 million, and I'm crawling myself.

Classification is really hard. I'm struggling with it myself, as a lot of like privacy policies and change logs turns out to share the shape of a page of text.

I'm thinking of experimenting with ML classifiers, as I do have reasonably good ways of extracting custom datasets. Finding change logs and privacy policies is easy, excluding them is hard.


If you're open to sharing your index I could make a classifier for you, I do this for a living. It's more of the indexing and search engine part which have been a problem for me. That's why I'm working from commoncrawls.


This is awesome. I've been looking for a long time for a search engine that basically takes everything Google does and does the opposite. Thank you for doing this, I will definitely be bookmarking it.

Is there a way to suggest or add sites? I went looking for woodgears.ca and only got one result. I also think my personal blog would be a good candidate for being indexed here but I couldn't find any results for it.


I would also love to add some sites which I were missing...

Unfortunately this doesn't seem to be a feature which new search engines are focusing on - Brave Search also misses that feature...


This has amazing potential. I'd encourage you to form a non-profit, turn this into something that can last as an organization without becoming what you're trying to avoid becoming. This is a good enough start that I bet you could raise a sizeable startup fund very soon from a combination of crowdfunding and foundation grants—I bet the Sloan Foundation would love this!


This is absolutely wonderful. I am LOVING the results I'm getting back from it: the sort of content-rich sites that have become nigh unreachable using traditional search engines. Thank you for building this!


I love this, and I love (many of) the results so far! What I can't find on the site is detail about what "too many modern web design features" means. Is it just penalizing sites with tons of JavaScript?


Javascript tags are penalized the hardest, but it also takes into consideration density of text per HTML. There's also some adjustments based on text length, which words occur in the page, etc.


Which software do you use to index the sites?


I wrote it myself from scratch. I have some metadata in mariadb, but the index is bespoke.

A design sketch of the index is that it uses one file with sorted URL IDs, one with IDs of N-grams (i.e. words and word-pairs) referring to ranges in the URL file; as well as a dictionary for relating words to word-IDs; that's a GNU Trove hash map I modified to use memory map data instead of direct allocated arrays.

So when you search for two words, it translates them into IDs using the special hash map, goes to the words file and finds the least common of the words; starts with that.

Then it goes to the words file and looks up the URL range of the first word.

Then it goes to the words file and looks up the URL range of the second word.

Then it goes through the less common word's range and does a binary search for each of those in the range of the more common word.

Then it grabs the first N results, and translates them into URLs (through mariadb); and that's your search result.

I'm skipping over a few steps, but that's the very crudest of outlines.


Good stuff. I've also been toying with doing some homegrown search engine indexing (as an exercise in scalable systems), and this is a fantastic result and great inspiration.

Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'.


Well just crunching the numbers should indicate what is possible and what isn't.

For the moment I have just south of 20 million URLs indexed.

1 x 20 million bytes = 20 Mb.

10 x 20 million bytes = 200 Mb.

100 x 20 million bytes = 2 Gb.

1,000 x 20 million bytes = 20 Gb.

10,000 x 20 million bytes = 200 Gb.

100,000 x 20 million bytes = 2 Tb.

1,000,000 x 20 million bytes = 20 Tb.

This is still within what consumer hardware can deal with. It's getting expensive, but you don't need a datacenter to store 20 Tb worth of data.

How many bytes do you need, per document, for an index? Do you need 1 Mb of data to store index information about a page that, in terms of text alone, is perhaps 10 Kb?


What crawler are you using and what kind of crawling speeds are you achieving?

How do you rank the results (is it based on content only) or you have external factors too?

What is your personal preferred search option of the 7 and why?

Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.


> What crawler are you using and what kind of crawling speeds are you achieving?

Custom crawler, and I seem to get around 100 documents per second at best, maybe closer to 50 on average. Depends a bit on how many crawl-worthy websites it finds, and there is definitely diminishing returns as it goes deeper.

>How do you rank the results (is it based on content only) or you have external factors too?

I rank based on a pretty large number of factors, incoming links weighted by the "textiness" of the source domain, and similarity to the query.

> What is your personal preferred search option of the 7 and why?

I honestly use Google for a lot. My search engine isn't meant as a replacement, but a complement.

> Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.

Are you kidding? I think the Patreon is a resounding success! I'm still a bit stunned. I've gotten more support and praise, not just in terms of money but also emails and comments here than I could have ever dreamed possible.

And this is just the start, too. I only recently got the search engine working this well. I have no doubt it can get much better. The fact that I have 11 people with me on that journey, even if they "just" pay my power bill, that's amazing.

I'm honestly a bit at a loss for words.


You have a great attitude!

And I am not kidding. I think for something that got so much attention on HN, where realistically this kind of product can only exist for now, the 'conversion' rate was very low. Billion dollar companies were made of HN threads with lot less engagement. Makes me wonder do we really want a search engine like this or we just like the idea of it?

And what are the barriers to use something like this? You say yourself that you are using Google most of the time. Is jumping to check results on this engine going to be too much friction for most uses?

Can something like this exist in isolation? What kind of value would it need to provide for users to remember using it en-masse as an additional/primary vertical search like they do for Amazon?

Just thinking out-loud as I am also interested in the space (through http://teclis.com).


I think in part it may just be because I'm not trying to found a start-up, and I'm not trying to get rich quick. If I were, I would have dealt with this very differently. My integrity and the integrity of this project is far more important than my bank balance. Not everyone feels that way, and I can respect that, but I do.

Ultimately I think running something like this for profit would create really unhealthy incentives to make my search engine worse. Any value it brings, right now, it brings because it isn't trying to cater to every taste and every use case.

I also hate the constant "don't forget to slap the like and subscribe buttons"-shout outs of modern social media, even though I'm aware they it is extremely efficient. If I went down that route, I would become part of the problem I'm trying to help cure. I do feel the sirens' call though, it's intoxicating getting this sort of praise and attention.

I want this to be a long-term project, not a some overnight cinderella story.

In the end, my search engine is never going to replace google. It isn't trying to, it's trying to complement it. It's barely able now, but hopefully I can make it much better in the months and years to come.


I think it's good not to have to depend on financial compensation for every single thing in your life, if you can be comfortable or do well otherwise.

This allows quite a bit of its own kind of freedom even if maximum financial opportunity is not fully exploited. Perhaps even because you are not grasping for every dollar on the table at all times.

You can do things without having to know if they will pay off, and if it turns out big anyway you can make money as a byproduct of what you do rather than having pure financial pursuit be the root of every goal.


I agree with everything you say. The 'subscribe and like buttons' would not help your conversion with HN readers, on the contrary. Trying to run this for profit also would not help your conversion with this audience.

So given your setup is already ideal for 'conversions' for this population (low profile, high integrity, no BS) I was simply genuinely surprised that only 11 people converted given enormous visibility/interest this thread had. Hope that makes sense.


I think it simply takes time to build trust. The threshold to sending someone money is high. I probably wouldn't send someone money based on a proof of concept and lofty ambitions alone.

I'd absolutely consider sending someone money if they kept bringing something of value into my life. If I want more people to join the patreon, I'll just have to earn their trust and support.


The day Google first appeared on the full internet it was excellent of course because it had no ads.

Plus another excellent feature was you would get the same search results no matter who or where you were for quite some period of calendar time.

If something new did appear it was likely to be one of the new sites that was popping up all the time and it was likely to be as worthwhile as its established associates on the front page.

You shouldn't need to crawl nearly as fast if you can compensate by treading more suitably where those have gone before.


Interesting, in my database (http://root.rupy.se) I have one file per word that contains the ids (long) of the nodes (URLs), so to search many words together I have to go through the first file and one by one see if I find matches in the second.

How does the range binary search work, does it just prune out the overlaps, how efficient is it and how much data do you have in there for say "hello" and "world" f.ex?


I’m not sure how you go from word to url range? Range implies contiguous, but how can you make that happen for a bunch of words without keeping track of a list of urls for each word (or URL ids, the idea is the same)?


The trick is that the list of URLs for each word already is in the URLs file.

The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence.

Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8.

The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8.

If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap.

This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10)


Hm, ok I understand more but how do you perform the "binary search", just loop over the URL ids?

Funny I also selected "hello" and "world" above! Xo

My system is also written in Java btw!

Here are example results of my word search:

http://root.rupy.se/node/data/word/four

http://root.rupy.se/node/data/word/only

etc.


I'll get back to your email in a while, I've got a ton of messages I'm working through.

But yeah, in short pseudocode:

  for url in range-for-"hello":
    if binary-search (range-for-"world", url):
      yield url
I do use streams, but that is the bare essence of it.


So every time you insert a new URL for a word you have to update the range for every other single word since the URL file will be shifted?


Are the n-grams always at most n=2 bigrams?


No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.


Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?


It's a great project!


I continue to want a communal set of 50-100 million urls and data that are "good" (i.e. for any value of good and more complete than common crawl) that are accessible enough to be easy to work with that can be used to experiment with different web search tech. There are enough separate problems to tackle in web search that breaking it down would maybe move the needle. We have lots of kaggle competitions about ranking, but using closed data. What other types of kaggle projects would help web search?

What can we do to foster a sustainable bazaar of projects to make it easier to build web search engines?


How are you doing the crawling without getting blocking? -- the hardest part.


Not OP but crawling is easy if you don't try scanning 5+ pages a second - almost all rate limiting/heuristic based 'keep server costs low' engines, including Cloudflare, don't care if you request every page, but will take action if you do something like burst every page and take up just as many server resources as a hundred concurrent users.

Now, that is assuming you aren't on some VPS provider. If you're going to crawl, you'll have the best chance when you use your own IPs on your own ASN, with DNS and reverse DNS set up correctly. This makes it so the IP reputation systems can detect you as a crawler but not one that hammers every site it visits.

Also, I imagine that, for a search engine like this, it doesn't expect content to change much anyways - so it can take its time crawling every site only once every month or two, instead of the multiple times a week (or day) search engines like Google have to for the constantly-updated content being churned out.


Pretty neat!!!

You may already be aware of this, but the page doesn't seem to be formatted correctly on mobile. The content shows in a single thin column in the middle.


Hmm, which OS? I only have a single Android phone so I've only fixed the CSS for that.


I was seeing it on Android w/ Firefox. Seems like it's fixed now though. :)


Curious, I haven't touched the stylesheets.


For example Firefox on Android.


Fennec F-Droid on Android 11 has some rendering issues.


Really like it, and love that you've done this yourself.

I'd prefer if it does just one thing and does that really well. Don't waste your time on calculator and conversion functions, or pseudo-natural language queries. There are plenty of good calculator & converter tools and websites, but we all need a really good search engine. I think you'd be better looking at handling synonyms and plurals.

Thanks.


I've mostly added these converters and frills because I'm trying to eat my own dogfood, and google is deeply engrained as my go-to source for unit conversions and back-of-envelope calculations.

Don't worry, this stuff is easy, and doesn't even remotely take away from the work on the harder problems.


Awesome project! How are you able to keep the site running after HN kiss of death? What is your stack, elastic search or something simper? How did you crawl so many websites for a project this size? Did you use any APIs like duck duck go or data from other search engines? Are you still incorporating something like PageRank to ensure good results are prioritized or is it just the text-based-ness factor?


> How are you able to keep the site running after HN kiss of death?

I originally targeted a Raspberry Pi4-cluster. It was only able to deal with about 200k pages at that stage, but it did shape the design in a way that makes very thrifty use of the available hardware.

My day job is also developing this sort of highly performance java applications, I guess it helps.

> What is your stack, elastic search or something simper?

It's a custom index engine I built for this. I do use mariadb for some ancillary data and to support the crawler, but it's only doing trivial queries.

> How did you crawl so many websites for a project this size?

It's not that hard. Like it seems like it would be, and there certainly is an insane number of edge cases, but if you just keep tinkering you can easily crawl dozens of pages per second even on modest hardware (of course distributed across different domains).

> Did you use any APIs like duck duck go or data from other search engines?

Nope, it's all me.

> Are you still incorporating something like PageRank to ensure good results are prioritized or is it just the text-based-ness factor?

I'm using a somewhat convoluted algorithm that takes into consideration the text-based-ness of the page, but also how many incoming links the domain has, but it's a weighted value that factors in the text-based-ness of the origin domains.

It would be interesting to try a page rank-style approach, but my thinking is that because it's the algorithm, it's also the algorithm everyone is trying to game.


Thank you so much for creating such a useful search engine!

Is there any way that you can get an HTTP certificate?

I use an old iPhone 4S, and most of the modern web is inaccessible due to TLS. Hacker News and mbasic.facebook are two of the last sites I can use.

Usually text-based sites are more accessible, so this could be really useful to help me continue using my antique devices!


is there a json endpoint ? I'd love to make an emacs bridge :)


Seconded, I’d like to incorporate it into a project of mine.


Love the idea. A little feedback: layout needs tweaking for mobile. FWIW: I'm on mobile Firefox for Android.


I searched Warcraft and got a gold selling/ level boosting site. Some things never change :)


This is really cool! A spacer between the links would help my old eyes; I keep getting lost in which link goes with which description. :-)


Hi,

Interesting idea. Definitely see an overlap with eReader markets and looking at text only contents.

How does it work?

It ignores pages on which it detects frameworks for ui and ads or any javascript code at all?


I can't see the letters in the disturbing white search box, I'm on duckduckgo, brave, monocles, jquarks, smartcookieweb on Android.


Nice, what are you using to crawl the web?


It's pretty much all bespoke.

I use external libraries for parsing HTML (JSoup) and robots.txt; but that's about it.


What was the starting site you fed to the crawler to follow the links from to build the index?


Just my (swedish) personal website. The first iteration of the search engine was probably mainly seeded by these links:

https://www.marginalia.nu/00-l%C3%A4nkar/

But I've since expanded my websites, so now I think these play a decent role in later iterations, although they are virtually all of them pages I've found eating my own dogfood:

https://memex.marginalia.nu/links/fragments-old-web.gmi

https://memex.marginalia.nu/links/bookmarks.gmi


Nice project, but have you heard of FrogFind? it also presents lightweight search results.


This is a very cool project! Thank you.


What's the tech stack?


It's custom java code, for the most part. I'm using mariadb for some ancillary information, but the index and the crawler and everything is written from scratch.

It's hosted on my consumer-equipment server (Ryzen 3900x, 128Gb ram, Optane 900p+a few IronWolf drives), bare bones on Debian.


I love this idea, and admire the work you put into it. I'm a fan of long reads and historical non-fiction, and Google's results are truly garbage.

I have a criticism that I think may pertain to the ranking methodology. I searched for "discovery of Australia". Among the top results were:

* A site claiming that the biblical flood was caused by Earth colliding with a comet (with several other pages from that site also making the top search results with other wild claims, e.g. that the Egyptians discovered Arizona);

* Another site claiming the first inhabitants of Australia were a lost tribe of Israel;

* A third site claiming that Australia was discovered and founded by members of a secret society of Rosicrucians who had infiltrated the Dutch East India Company and planned to build an Australian utopia...

These were all pages heavy with HTML4 tags and virtually devoid of Javascript, the kinds of pages you'd frequently see in the late 1990s from people who had built their own static websites in a text editor, or exported HTML from MS Word. At that time, there were millions of those sites with people paying for their own unique domain names, and so the proportion of them that were home to wild-eyed conspiracy theories was relatively small. What I think has happened is that kooks continued to keep these sites up - to the point where it's almost a visual trope now to see a red <h1> tag in Times New Roman and think, uh oh, I've stumbled on an "ancient aliens" site. Whereas scholars and journals offering higher quality information have moved to more modern platforms that rely more heavily on modern browsers - with or without their own domain names. So as a result what seemed to surface here were the fragments of the old web that remain live - possibly because people living in cabins in Montana forget to cancel their web hosting, or because the nature of old-school conspiracy theorists is to just keep packing their old sites with walls of text surrounded by <p> tags.

Arguably, this seems to rank the way Google's engine used to, since it couldn't run JS and they wanted to punish sites that used code to change markup at render time. At least, when I used to have to do onsite SEO work, it was always about simple tag hierarchies.

I wonder whether there isn't some better metric of validity and information quality than what markup is used. Some of the sites that surfaced further down could be considered interesting and valuable resources. I think not punishing simple wall-of-text content is a good thing. But to punish more complicated layouts may have the perverse effect of downranking higher-quality sources of information - i.e. people and organizations who can afford to build a decent website, or who care to migrate to a modern blogging platform.


those three pages sound pretty interesting, I don't see this as a problem


I don’t want my search engine to somehow try to judge the believability of the results. I’d like to be the judge of that myself.


Great work. Working on an alternative search engine too. Take a look at my profile.


fantastic project, thank you!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: