Hacker News new | past | comments | ask | show | jobs | submit login
A search engine that favors text-heavy sites and punishes modern web design (marginalia.nu)
3441 points by Funes- on Sept 16, 2021 | hide | past | favorite | 717 comments



Wow, that's awesome. Great work!

For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.

When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.


Good comparison. Reminds me of an analogy I like to make of today's web, which is it feels like browsing through a magazine store — full of top 10s, shallow wow-factoids, and baity material. I genuinely believe terrible results like this are making society dumber.


The context matters. I'd happily read "Top 10" lists on a website if the site itself was dedicated to that one thing. "Top 10 Prog Rock albums", while a lazy, SEO-bait title, would at least be credible if it were on a music-oriented website.

But no, these stories all come from cookie-cutter "new media" blog sites, written by an anonymous content writer who's repackaged Wikipedia/Discogs info into Buzzfeed-style copy writing designed to get people to "share to Twitter/FB". No passion, no expertise. Just eyeballs at any cost.


This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages. This produces the problem where instead of covering a topic and refining it over time, the incentive is to repackage it over and over again.

It reminds me of an annoyance I have with the Kindle store. If I wanted to find a book on, let's say, Psychology, there is no option to find all-time respected books of the past centenary. Amazon's algorithms constantly push to recommend the latest hot book of the year. But I don't want that. A year is not enough time to have society determine if the material withstands time. I want something that has stood the test of time and is recommended by reputable institutions.


This is just a guess, but I believe that they use machine learning and rank it by the clicks. I took some coursera courses and Andrew Ng sort of suggested that as their strategy.

The problem is that clickbait and low effort articles could be good enough to get the click, but low effort enough to drag society into the gutter. As time passes, the system is gamified more and more where the least effort for the most clicks is optimized.


It sounds like the problem is, the search engine has no way to measure satisfaction after the click.


But they have. or could have. At least Google (and to a smaller extend Microsoft), if you are using Chrome/Bing have exactly that signal. If you stay on the site and scroll (taking time, reading, not skimming) all this could be a signal to evaluate if the search result met your needs.


I've heard google would guess with bounce rate. Or another way, if the user clicks on LinkedIn website A, after a few moments keeps trying other linksw/related search. It would mean it was not valuable.


They tried to insight this with the "bounce rate"


> is that the algorithms prioritize newer pages over older pages.

They do? That would explain a lot - but ironically, I can't find a good source on this. Do you have one at hand?


It is pretty obvious if you search for any old topic that is also covered incessantly by the news. "royal family" is a good example. There's no way those news stories published an hour ago are listed first due to a high PageRank score (which necessarily depends on time to accumulate inbound links).


It depends on the content. The flip side is looking up a programming-related question and getting results from 2012.

I think they take different things into account based on the thing being searched.


Even your example would depend upon the context. There are many cases where a programming question in 2021 is identical to one from 2012, along with the answer. In those instances, would you rather a shallow answer from 2021 or an indepth answer from 2012? This is not meant to imply that older answers offer greater depth, yet a heavy bias towards recent material can produce that outcome in some circumstances.


If you're using tools/languages that change rapidly (like Kotlin, in my case), syntax from a few years ago will often be outdated.


Yes, yet there are programming questions that go beyond "how do I do X in language Y" or "how do I do X with library Y". The language and library specific questions are the ones where I would be less inclined to want additional depth anyhow, well, provided they aren't dependent upon some language or library specific implementation detail.


There are of course a variety of factors, including the popularity of the site the page is published on. The signals related to the site are often as important as the content on the page itself. Even different parts of the same site can lend varying weight to something published in that section.

Engagement, as measured in clicks and time spent on page, plays a big part.

But you're right, to a degree, as frequently updated pages can rank higher in many areas. A newly published page has been recently updated.

A lot depends on the (algorithmically perceived) topic too. Where news is concerned, you're completely right, algos are always going to favor newer content unless your search terms specify otherwise.

PageRank, in it's original form, is long dead. Inbound link related signals are much more complex and contextual now, and other types of signals get more weight.


Your Google search results show the date on articles do they not? If people are more likely to click on "Celebrity Net Worth (2021)" than "Celebrity Net Worth (2012)", then the algo will update to favour those results, because people are clicking on them.

The only definitive source on this would be the gatekeeper itself. But Google never says anything explicitly, because they don't want people gaming search rankings. Even though it happens anyway.


The new evergreen is refreshed sludge for bottom dollar. College kids stealing Reddit comments or moving around paragraphs from old articles. Or linking to linked blogs that link elsewhere.

It's all stamped with Google Ads, of course, and then Google ranks these pages high enough to rake in eyeballs and ad dollars.

Also there's the fact that each year, the average webpage picks up two more video elements / ad players, one or two more ad overlays, a cookie banner, and half a dozen banner/interstitials. It's 3-5% content spread thinly over an ad engine.

The Google web is about squeezing ads down your throat.


Really makes you wonder: you play whack a mole and tackle the symptoms with initiatives like this search engine. But the root of that problem and many many others is the same: advertising. Why don't we try to tackle that?


Perhaps a subscription-based search engine would avoid these incentives.


Let’s go a few levels deeper and question our consumption culture


Exactly.

The only reason people make content they aren't passionate about is advertising.


> This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages.

Actually that's not always the case. We publish a lot of blog content and it's really hard to publish new content that replaces old articles. We still see articles from 2017 coming up as more popular than newer, better treatments of the same subject. If somebody knows the SEO magic to get around this I'm all ears.


Amazon search clearly does not prioritize exact title matches.


Its the "healthy web" Mozilla^1 and Google keep telling their blog audiences about. :)

1 Accept quid pro quo to send all queries to Google by default

If what these companies were telling their readers was true, i.e., that advertising is "essential" for the web to survive, then how are the sites returned by this search engine for text-heavy websites (that are not discoverable through Google, the default search engine for Chrome, Firefox, etc.) able to remain online. Advertising is essential for the "tech" company middleman business to survive.


I'm not sure I agree with your example. It seems to me it is the exact same as a "Top ten drinks to drink on a rainy day" list. There's simply too many good albums and opinions differ, so a top ten would -just like the drinks- end up being a list of the most popular ones with maybe one the author picks to stir some controversy or discussion. In my opinion the world would be a smarter place if Google ranked all such sites low. Then we might at least get fluff like "Top ten prog rock albums if you love X, hate Y and listen to Z when no one is around" instead.


Google won't rank them low because they actually do serve an important purpose. They're there for people who don't really know what they want specifically, they're looking for an overview. A top 10 gives a digestible overview on some topic, which helps the searcher narrow down what they really want.

A "Top 10 albums of all time" post is actually better off going through 10 genres of popular music from the past 50 years and picking the top album (plus mentioning some other top albums in the genre) for each one.

That gives the user the overview they're probably looking for, whether those are the top 10 albums of all time or not. It's a case of what the user searched for vs what they actually really want.


"The best minds of my generation are thinking about how to make people click ads"


So did Tim Berners-Lee. He was vehemently opposed to people shoehorning images into the WWW, because he didn't want it to turn into the equivalent of magazines. Which, I believe, he shared the opinion of them making society dumber.

Appropriately enough, I couldn't find a good quote to verify that since Google is only giving me newspapers and magazines talking about Sir Tim in the context of current events. I do believe it's in his book "Weaving the Web" though.


It's also possible that it's the other way around: a certain "common denominator" + algorithms that chase broad engagement = mediocre results.

The real trick would be some kind of engine that can aim just above where the user's at.


> I genuinely believe terrible results like this are making society dumber.

You have to e causality reversed. Google results reflect the fact that society is dumb.


Google results reflect the fact that educating and informing people has low profit margins.


Or the distribution of people now online better reflects stupidity in the general population.


what I really want is a true AI to search through all that and figure out the useful truth. I don't know how to do this (and of course whoever writes the AI needs to be unbiased...)


>whoever writes the AI needs to be unbiased...)

I'm not sure the idea of a sentient being not having a bias is meaningful. Reality, once you get past the trivial bits, is subjective.


Isn't there a fundamental ML postulate that learning without bias is impossible?

Maybe not the same kind of bias we think of in terms of politics and such, but I wonder if there's a connection.


I didn't say the AI should be unbiased, just whoever writes it.

I want an AI that is biased to the truth when there is an objective one, and my tastes otherwise. (that is when asked to find a good book it should give me fantasy even though romance is the most popular genre and so will have better reviews)


I think that is the goal, it's just what we currently have is an AI that's like a naive child who is easily tricked and distracted by clickbait.


>an AI that's <snip> easily tricked and distracted by clickbait.

So, AIs are actually on par with most adults now? (Sorry)


Cool, it appears that the trend towards JS may be causing self-selection -- if a page has a high amount of JS, it is highly unlikely to contain anything of value.


True. Unfortunately many large corporate websites through which you pay bills, order tickets, etc. are becoming infested with JS widgets and bulky, slow interfaces. These are hard to avoid.


Conversely no software to install. Browser as a platform. Don’t have to boot to Windows to pay your bills with activex for example


The mostly JS-less web was fine, fast, and reliable 20 years ago and I never had ActiveX.

I hear stories about Flash and ActiveX but I literally never needed these to shop or pay bills online. Payments also didn't require scripts from a dozen domains and four redirects..


Yup, and taking payments online was awful but privacy was more of a thing. In South Korea ActiveX was required until recently. https://www.theregister.com/2020/12/10/south_korea_activex_c...


The platform isn't the problem. The problem is with the amount of code that does something other than letting you "pay bills, order tickets, etc.".


Huh. A weighted algorithm, somewhere between Google and the one linked, where you could subtract from sites by amount of JavaScript might be interesting.


If one could create an metric of ad to content ratio from the js used, I would guess that would be a nice differentiator too.


Browsers should be cherry picking the most compelling things that people accomplish with complex code and supporting them as a native feature. Maybe the Browser Wars aren’t keeping up anymore.


Was that ever in doubt?


However, when searching for "haskell type inference algorithm" I get completely useless results.


That query is too long apparently. But if you shorten to "haskell type inference", I think it delivers on its promise:

> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?


The search engine doesn't do any type of re-ordering or synonym stuff, it only tires to construct different N-grams from the search query.

So if you for example compare "SDL tutorial" with "SDL tutorials". On google you'd get the same stuff, this search engine, for better or worse doesn't.

This is a design decision, for now anyway, mostly because I'm incredibly annoyed when algorithms are second-guessing me. On the other hand, it does mean you sometimes have to try different searches to get relevant results.


I like this design decision. It pays you back for choosing your search terms carefully.


I’m not against a stemmer, actually, just against the aggressive concordances (?) that Google now employs, like when it shows me X in Banach spaces (the classical, textbook case) when I’m specifically searching for X in Fréchet spaces (the generalization I want to find but am not sure exists); of course Banach spaces and Fréchet spaces are almost exclusively encountered in the same context, but it doesn’t mean that one is a popular typo for the other! (The relative rarity of both of these in the corpus probably doesn’t help. The farcical case is BRST, or Becchi-Rouet-Stora-Tyutin, in physics, as it is literally a single key away from “best” and thus almost impossible to search for.)

On the other hand, Google’s unawareness of (extensive and ubiquitous) Russian noun morphology is essentially what allowed Yandex to exist: both 2011 Yandex and 2021 Google are much more helpful for Russian than 2011 Google. I suspect (but have not checked) that the engine under discussion is utterly unusable for it. English (along with other Germanic and Romance languages to a lesser extent) is quite unusual in being meaningfully searchable without any understanding of morphology, globally speaking.


I thought you could fix that by enclosing "BRST" in quotes, but apparently not. DuckDuckGo (which uses Google) returns a couple of results that do contain "BRST" in a medical context, but most results don't contain this string at all. What's going on?


I’m not certain what DDG actually uses (wasn’t it Bing?), but in my experience from the last couple of months it ignores quotes substantially more eagerly than Google does. For this particular term, a little bit of domain knowledge helps: even without quotes, brst becchi, brst formalism, brst quantization or perhaps bv brst will get you reasonable results. (I could swear Google corrected brst quantization to best quantization a year ago, but apparently not anymore.) Searching for stuff in the context of BRST is still somewhat unpleasant, though.

I... don’t think anything particularly surprising is happening here, except for quotes being apparently ignored? I’ve had it explained to me that a rare word is essentially indistinguishable from a popular misspelling by NLP techniques as they currently exist, except by feeding the machine a massive dictionary (and perhaps not even then). BRST is a thing that you essentially can’t even define satisfactorily without at the very least four years of university-level physics (going by the conventional broad approach—the most direct possible road can of course be shorter if not necessarily more illuminating). “Best” is a very popular word both generally and in searches, and the R key is next to E on a Latin keyboard. If you are a perfect probabilistic reasoner with only these facts for context (and especially if you ignore case), I can very well believe that your best possible course of action is to assume a typo.

How to permit overriding that decision (and indeed how to recognize you’ve actually made one worth worrying about without massive human input—e.g. Russian adjectives can have more than 20 distinct forms, can be made up on the spot by following productive word-formation processes, and you don’t want to learn all of the world’s languages!) is simply a very difficult problem for what is probably a marginal benefit in the grand scheme of things.

I just dislike hitting these margins so much.


It would not be a difficult problem if they allowed the " " operator to work as they claim it does, or revive the + operator.


In English, maybe; in Russian, I frequently find myself reaching for the nonexistent “morphology but not synonyms” operator (as the same noun phrase can take a different form depending on whether it is the subject or the object of a verb, or even on which verb it is the object of); even German should have the same problem AFAIU, if a bit milder. I don’t dare think about how speakers of agglunative languages (Finnish, Turkish, Malayalam) suffer.

(DDG docs do say it supports +... and even +"...", but I can’t seem to get them to do what I want.)


Ah, OK. I don’t know anything about Russian. This is a hard problem. I think the solution is something like what you suggest: more operators allowing different transformations. Even in English, I would like a "you may pluralize but nothing else" operator.


Well it’s not that alien, it (along with the other Eastern Slavic languages, Ukrainian and Belarusian) is mostly a run-of-the-mill European language (unlike Finnish, Estonian or Hungarian) except it didn’t lose the Indo-European noun case system like most but instead developed even more cases. That is, where English or French would differentiate the roles of different arguments of a verb by prepositions or implicitly by position, Russian (like German and Latin) has a special axis of noun forms called “case” which it uses for that (and also prepositions, which now require a certain case as well—a noun form can’t not have a case like it can’t not have a number).

There are six of them (nominal [subject], genitive [belonging, part, absence, “of”], dative [indirect object, recipient, “to”], accusative [direct object], instrumental [device, means, “by”], prepositional [what the hell even is this]), so you have (cases) × (numbers) = 6 × 2 = 12 noun forms, and adjectives agree in number and gender with their noun, but (unlike Romance languages) plurals don’t have gender, so you have (cases) × (numbers and genders) = 6 × (3 + 1) = 24 adjective forms.

None of this would be particularly problematic, except these forms work like French or Spanish verbs: they are synthetic (case, number and gender are all a single fused ending, not orthogonal ones) and highly convoluted with a lot of irregularities. And nouns and adjectives are usually more important for a web search than verbs.


> BRST, or Becchi-Rouet-Stora-Tyutin is literally a single key away from “best” and thus almost impossible to search for.

Hmm I seem to be getting only relevant results, no "best", not sure what you mean. Are you not doing verbatim search?

https://www.google.com/search?q=brst&tbs=li:1


English is more the outlier in regard to Germanic languages, try German or Finnish, with their wonderful compounds :)

https://e.humanities.uva.nl/publications/2004/kamp_lang04.pd...


Well yeah, English is kind of weird, but Finnish isn’t a Germanic language at all? It’s not even Indo-European, so even Hindi is ostensibly closer to English than Finnish. I understand Standard German (along with Icelandic) is itself a bit atypical in that it hasn’t lost its cases when most other Germanic languages did.

Re compounds, I expected they would be more or less easy to deal with by relatively dumb splitting, similar to greedy solutions to the “no spaces” problem of Chinese and Japanese, and your link seems to bear that out. But yeah, cheers to more language-specific stuff in your indexing. /s


Gaaah, brain fart - you're right, of course, dunno why I included it.


Maybe list the synonyms under the query, so its easier to try different formulations.


Oh this sounds like it could be a really cool idea! This way it could also be subtly teaching users that the engine doesn't do automatic synonyms translation so it's worth experimenting; also kinda like giving the synonyms feature while still keeping user in full control.


Don't change it. It's good this way.


It could simply become an option.


Since it does not use synonyms, it looks like it is unable to answer "how's that thing called"-queries.


It would be nice if we could pipe search engines.


Definitely; We could create a meta search engine that queries them all, in desktop application format.

Let's name it after a famous old scientist, and maybe add the year to prove it's modern: Galileo 2021.


Meta search engines leave a bad taste in everyone's mouth because they've always failed. Here is why

https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theore...

You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.


> You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.

I am skeptical of this application of the theorem. Here is my proposal:

Take the top 10 Google and Bing results. If the top result from Bing is in the top 10 from Google, display Google results. If the top result from Bing is not in the top 10 from Google, place it at the 10th position. You'd have an algorithm that ties with Google, say 98% of the time, beats it say, 1.2% of the time, and loses .8% of the time.


Right. Arrow's theorem just says it's impossible to do it in all cases. It's still quite possible to get an improvement in a large proportion of cases, as you're proposing.


I've had jobs tuning up the relevance of search engines with methods like

https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...

and the first conclusion is "something that you think will improve relevance probably won't"; the TREC conference went for about five years before making the first real discovery

https://en.wikipedia.org/wiki/Okapi_BM25

It's true that Arrow's Theorem doesn't strictly apply, but thinking about it makes it clear that the aggregation problem is ill-defined and tricky. (e.g. note also that a ranking function for full text search might have a range of 0-1 but is not a meaningful number, like a probability estimate that a document is relevant, but it just means that a result with a higher score is likely to be more relevant than one with a lower score.)

Another way to think about it is that for any given feature architecture (say "bag of words") there is an (unknown) ideal ranking function.

You might think that a real ranking function is the ideal ranking function plus an error and that averaging several ranking functions would keep the contribution of the ideal ranking function and the errors would average out, but actually the errors are correlated.

In the case of BM25 for instance, it turns out you have to carefully tune between the biases of "long documents get more hits because they have more words in them" and "short documents rank higher because the document vectors are spiky like the query the vectors". Until BM25 there wasn't a function that could be tuned up properly and just averaging several bad functions doesn't solve the real problem.


Arrows theorem simply doesn't apply here. We don't need our personalized search results to satisfy the majority.


But in both cases you face the problem of aggregating preferences of many into one. In one case you are combining personal preferences in the other case aggregating ‘preferences’ expressed by search engines.


But search engines aren't voting to maximize the chances that their preferred candidate shows up on top. The mixed ranker has no requirement to satisfy Arrows integrity constraints. It has to satisfy the end user, which is quite possible in theory.

Conditions the mixed ranker doesn't have to satisfy "ranking while also meeting a specified set of criteria: unrestricted domain, non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives"


Sure, but the problem that conventional IR ranking functions are not meaningful other than by ordering leads you to the dismal world of political economy where you can't aggregate people's utility functions. (Thus you can't say anything about inequality, only about Pareto efficiency)

Hypothetically you could treat these functions as meaningful but when you try you find that they aren't very meaningful.

For instance IBM Watson aggregated multiple search sources by converting all the relevance scores to "the probability that this result is relevant".

A conventional search engine will do horribly in that respect, you can fit a logit curve to make a probability estimator and you might get p=0.7 at the most and very rarely get that, in fact, you rarely get p>0.5.

If you are combining search results from search engines that use similar approaches you know those p's are not independent so you can't take a large numbers of p=0.7's and turn that into a higher p.

If you are using search engines that use radically different matching strategies (say they return only p=0.99 results with low recall) the Watson approach works, but you need a big team to develop a long tail of matching strategies.

If you had a good p-estimator for search you could do all sorts of things that normal search engines do poorly, such as "get an email when a p>0.5 document is added to the collection."

For now alerting features are either absent or useless and most people have no idea why.


That's an invalid application of this theorem. (It doesn't necessarily hold)

Suppose there's an unambiguous ranked preference by all people among a set (webpages, ranking). Suppose one search engine ranks correctly the top 5 results and incorrectly the next 5 results, while another ranks incorrectly the top 5 and correctly the next 5.

What can happen is that some there may be no universally preferred search engine (likely). In practice, as another commenter noted, you can also have most users prefer more a certain combination of results (that's not difficult to imagine, for example by combining top independent results from different engines for example).


... is this Galileo 2021 a reference that I am not understanding?


Yup, but so far no one got it.

There was such an app in the early 2000's, before Google went mainstream, and Altavista-like engines were not good: Copernic 2000.

I guess I'm officially old now.


For years I wanted to try Copernic Summarizer. It seemed like it actually worked. Then software that did summaries disappeared, maybe? And about 5 years ago bots on Reddit were doing summaries of news stories (and then links in comments).

This is a pattern I see over and over again, some research group or academics show that something can be done (summaries that make sense and are true summaries, evolutionary algorithm FPGA programming, real time gaze prediction, etc) and there's a few published code repos and a bit of news, then 'poof' - no where to be seen for 15 years or more.


FWIW, I got the reference. Maybe I'm old too?


I was always a dogpile user :p


Hotbot!


Brave browser currently has "Google fallback" which sometimes mixes in Google search results with Brave's own search engine.

https://search.brave.com/help/google-fallback


Not an app, but probably comes quite close in all other respects: https://metager.org


I need that with a simpler interface, so I call it after a famous dedective: Sherlock.


a magic pop sound is faintly audible as a new side project is appended to several lists Excellent, thank you!


Likely trademark collision with this: https://www.galileo.usg.edu/


As long as few people use it, it will be great. Rest assured that the moment it becomes popular, the people who want to game it will appear.


This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.

Still, the best way to break SEO is to have actual competition in the search space. As long as SEO remains focused on Google there is an opportunity for these companies to thrive by evading SEO braindamage.


That sort of recipe blog hasn't happened just for SEO. It's also a bit of a "two audiences" problem: if you are coming to that food blogger from a search you certainly would prefer the recipe first and then maybe any commentary on it below if the recipe looks good. If you are a regular reader of that food blogger you are probably invested in the stories up top and that parasocial connection and the recipes themselves are sometimes incidental to why you are a regular reader.

You see some of that "two readers" divide sometimes even in classic cookbooks, where "celebrity" chefs of the day might spend much of a cookbook on a long rambling memoir. Admittedly such books were generally well indexed and had table of contents to jump right to the recipes or particular recipes, but the concept of "long personal ramble of what these recipes mean to me" is an old one in cookbooks too.


I see your point, but argue you've misidentified the two audiences.

One audience matches your description and is the invested reader. They want that blogger's story telling. they might make the recipe, but they're a dedicated reader.

The other audience is not the recipe-searcher, but instead Google. Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on. They won't even remember the blog's name. So the site isn't optimized for them. It's optimized for Google.

"Slow the parasitic recipe-searcher down. They're leeches, here for a freebie. Well they'll pay me in Google Rank time blocks."


> Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on.

This is not entirely true, though. If a randomly found recipe turns out particularly good, I'll bookmark the site and try out other dishes. It's a very practical method to find particularly good* recipe collections.

*) In this case "good" means what you need - not just subjectively "tasty", but e.g. low cost, quick to prepare, low calorie or in line with a particular diet and so on.


I'm aware of zero human-behavior-truisms that are "entirely true".


> If you are a regular reader of that food blogger

I think this assumes facts not in evidence. It certainly seems like an overwhelming number of "blogs" are not actual blogs but SEO content farms. There's no regular readers of such things because there's no actual authors, just someone that took a job on Fivver to spew out some SEO garbage. Old content gets reposted almost verbatim because new results better according to Google.

The only reason these "blogs" exist is to show ads and hopefully get someone's e-mail (and implied consent) for a marke....newsletter.


I know at least a few that I commonly see in top search results that I have friends that read them like personalized soap operas where most of the drama revolves around food and family and serving food to family.

It's at least half the business models of Food Network shows: aspirational kitchens and the people that live in them and also sometimes here's their recipes. (The other half being competitions, obviously.) I've got friends that could deliver entire doctoral theses on the Bon Appetit Test Kitchen (and its many YouTube shows and blogs) and the huge soap operatic drama of 2020's events where the entire brand milkshake ducked itself; falling into people's hearts as "feel good" entertainment early in 2020/the pandemic and then exploding very dramatically with revelations and betrayals that Fall.

Which isn't to say that there aren't garbage SEO farms out there in the food blogging space as well, but a lot of the big ones people commonly complain about seeing in google's results do have regular fans/audiences. (ETA: And many of the smaller blogs want to have regular fans/audiences. It's an active influencer/"content creator" space with relatively low barrier to entry that people love. Everyone's family loves food, it's a part of the human condition.)


I've basically never been taken to a recipe without a rambling preamble from Google. While food blogs may serve two audiences, a long introduction seems to be a requirement to appear in the top Google search results.


Personally, I think that has a lot more to do with the fact that Google killed the Recipe Databases. There did used to be a few startups that tried to be Recipe Aggregators with advertising based business models, that would show recipes and then link to source blogs and/or cookbooks, and in the brief period where they existed Google scraped them entirely and showed entire recipes on search results and ate their ad revenue out from under them.


Such databases would get battered by demands to remove content these days, if not already back then. No one want a database listing their stuff for ad revenue like that because many wouldn't follow the links so see their adverts or be subject to their tracking.

A couple of browser add-ons specifically geared around trimming recipe pages down have been taken down due to similar complaints.


That is a really bad thing by Google. Their core business is not recipes.


Their core business is making money from other people’s content, no matter what it is.


Their core business is advertising and they have always been in a direct conflict-of-interest by competing with content sites for ad revenue buys.


That's why I use Saffron [1], it magically converts those sites into a page in my recipe book. I found it when the developer commented here in HN. Also, a lot of cooking website have started to add a link with "jump to recipe" functionality allowing you to skip all the crap.

[1] https://www.mysaffronapp.com/


There's also https://based.cooking.


I've noticed this pattern start to pop up elsewhere. I've started to train my skimming skills, skipping a paragraph or two at a time to get past the fluff.

Like an article about some current event will undoubtedly begin with "when I was traveling ten years ago...".


It's also because that's a way of trying to copyright protect recipes, which are normally not copyright protected.

> “Mere listings of ingredients as in recipes, formulas, compounds, or prescriptions are not subject to copyright protection. However, when a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a combination of recipes, as in a cookbook, there may be a basis for copyright protection.”


But that copyright protection only extends to the literary expression. The recipe itself is still not covered by copyright, even if accompanied by an essay.


That's not really for SEO, which favors readily accessible information.

That's ads. When mobile users have to scroll past 10 add, theyll click on some of them and make the blog money.


Searching for ‘chocolate’ on this search engine turned up a surprisingly large amount of chocolate based recipes.


>> This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.

I continue to be curious about this kind of complaint. If all you want is a recipe list, without any of the fluff, why would you click on a link to a blog, rather than on a link to a recipe aggregator?

Foodie blogs exist specifically for the people who want a foodie discussion and not just an ingredients' list.

Is it because blogs tend to have better recipes overall? In that case, isn't there a bit of entitlement involved in asking that the author self-sacrificingly provides only the information that you want, without taking care of their own needs and wants, also?


I think the complaint is that those blogs rank higher than nuts-and-bolts recipes now. It wasn't that way a few years ago. Yes, scrolling down the results to Food Network or Martha Stewart or whatever is possible, as is going directly to those sites and using their site search, but it's noticeable and annoying.


Not my experience. For a very quick test, I searched DDG for "omelette recipe, "carbonara recipe" and "peking duck recipe" (just to spice it up a bit) and all my top results are aggregators. Even "avgolemeono recipe" (which I'd think is very specialised) is aggregators on top.

To be honest, I don't follow recipes when I cook unless it's a dish I've never had before. At that point what I want is to understand the point of the dish. A list of ingredients and preparation instructions don't tell me what it's supposed to taste and smell like. The foodie blogs at least try to create a certain... feeling of place, I suppose, some kind of impression that guides you when you cook. I wouldn't say it always works but I appreciate the effort.

My real complaint with recipe writers is that they know how to cook one or two dishes well and they crib the rest off each other so even with all the information they provide, you still can't reliably cook a good meal from a recipe unless you've had the dish before. But that's my personal opinion.


Because when you search for a recipe you get the link to the blog, not the aggregator.


It's the same thing that people always complain about. This thing is not in a format that I like, so it must be not what anyone likes.

If you want JUST recipes, pay money instead of just randomly googling around. America's test kitchen has a billion, vetted, and really good recipes. That solves that problem.


Specialization mostly a problem in monocultures.

If you almost only plant wheat, you are going to end up with one hell of a pest problem.

If you almost only have Windows XP, you are going to have one hell of a virus problem.

If you almost only have SearchRank-style search engines (or just the one), you are going to have one hell of a content spam problem.

Even though they have some pretty dodgy incentives, I don't think google suffers quality problems because they are evil, I think ultimately they suffer because they're so dominant. Whatever they do, the spammers adapt almost instantly.

A diverse ecosystem on the other hand limits the viability of specialization by its very nature. If one actor is attacked, it shrinks and that reduces the opportunity for attacking it.


I don't think the existing media-heavy websites are gaming Google to rank higher. It's that Google itself prefers media heavy content; they don't have to "game" anything.

I also think a search engine like this would be quite hard to game. An ML-based classifier trained on thousands of text-heavy and media-heavy screenshots should be quite robust and I think would be very hard to evade, so the "game" will become more about how identify the crawler so you can serve it a high-ranking page while serving crap to the real users, and it seems fairly easily to defeat if the search engine does a second pass using residential proxies and standard browser user agents to detect this behavior (it could also threaten huge penalties like the entire domain being banned for a month to even deter attempts at this).


With the advances in text generation by machines that looks, but isn't quite accurate (aka GPT-3), seems like it would be easily gamed (given access to GPT-3). Even without GPT-3, if the content being prioritized is mere text, I'm sure that for a pile of money, I could generate something that looks like Wikipedia, in the sense that it's a giant pile of mostly text, but it would make zero sense to a human reader. (Building an SEO farm to boost ranking of not-wikpedia is left as an exercise for the reader.)


If there were a wider variety of popular search engines, with different ranking criteria, would sites begin to move away from gaming the system? Surely it would be too hard to game more than one search engine at a time?


It would be a matter of numbers anyway about which they optimize for. A/B testing is already in place and doesn't care about where it comes from, just which one does better.


There should be some perfect balance where this search engine is N% as popular as Google, where Google soaks up all of the gamifiers, but this search engine is still popular enough to derive revenue and do ML and other search-engine-useful stuff.


> the people who want to game it will appear.

So just add human review to the mix, if a site is obviously trying to game the system (listicles, seo spam etc) just drop and ban them from the search index.


Congratulations, you've just invented negative SEO.


>followed by a listicle "8 Reasons Why Rome Fell"

but arent you curious about the 7th reason? it will surprise you!


You wont believe how Claudius looks today!


Doctors HATE him!!!


Search engines whose revenue is based on advertising will ultimately be tuned to steer you to the ad foodchain. All the incentives are aligned towards and all the metrics ultimately in service of, profit for advertisers. Not in the 99% of people who can convinced to consume something by ads? Welp, screw you.


Search engines should be something you pay for. Surely search engine powerusers can afford to pay for such a service. If Google makes $1 per user per month or something, that's not too high a bar to get over.


Search engines should be like libraries. At least some tiny sliver of the billions we spend on education and research should go to, you know, actually organizing the world's information and making it universally available.


I see another issue here: companies like Google prioritize information to 1) keep their users and 2) maximize their profit.

If you move data organization to another type of organization (non-profit, state, universities - private or public), then the question of data prioritization becomes highly political. What should be exposed? What should not? What to put first? ...

It is already, but to a smaller extend since money-making companies have little interest in data meaning, and high interest in the commercial value of their users.


In which case, consider paying for something like Infinity: https://infinitysearch.co/


The theoretical cap for this, if you include every human being on planet Earth, is 7 billion/month. This translates into $84 billion annual revenue.

Google's revenue last year was 146 billion, and it operates not anywhere near the theoretical maximum. Most of that revenue is advertisement.


The Wikipedia link at the top is always given. It would maybe be good to make it a little clearer that it's not one of the true results.


I think this is just because of terms you have searched. In my test-searches Wikipedia has not come up once in first position (i think the highest was 3rd in the list).

Here's what I've tried with a few variations: golang generics proposal, machine learning transformer, covid hospitalization germany

[edit] formatting


I think maybe it's a special insert at the top, but only if a Wikipedia page is found that matches you search term? I'm not sure now though.


Imagine if you were looking for the movie.


The you'd use a different search engine. Why does everything have to be a Swiss Army knife?


Or you could just search for 'rome movie'. Though for more complex disambiguation you would need to resort to, e.g. schema.org descriptions (which are supported by most search engines, and the foundation for most "smart" search result snippets).


That's a fair point. This engine would be useful if you need grep over internet (by without regexes), i.e. when you want to find the exact phrases. But that's a relatively narrow use case.


I tend to prefer Wikipedia for movies. The exception is actor headshots if I'm trying to identify someone, which Wikipedia lacks for licensing reasons, but otherwise Wikipedia tends to be better than IMDB for most needs. Wikipedia has an IMDB link on every article anyway.

Another need I guess might be reviews, for which RT or MC are better than IMDB: not sure if either of those two will fare better than IMDB in this search engine but again Wiki has links out (in addition to good reception summaries)


For me, imdb was much better when they had user comments/discussion.

I never even posted on it myself, but browsing the discussions one could learn all sorts of trivia, inside info, speculation, etc about each movie.

Since they (inexplicably) killed that feature, I rarely even visit anymore. Your right, for many purposes wikipedia is better, especially for TV series episode lists with summaries.


IMDB management thought it was their brilliant editorial work that drew people to their site. Morons. It was the comments all along. Of course they also believed they could create gravity-free zones by sheer force of executive will (and maybe still do).


Especially for old and lesser known movies, the discussion board for the movie was a brilliant addition that could give the movie an extra dimension. Context is very important in order to understand, and ulitmately enjoy something.

I think they removed it in part because new movies, like star wars and superhero movies, had alot of negative activity.


I find IMDb to be more convenient than RT/MC/Wikipedia for finding release dates of movies - nearly every other website lists only the American release date, maybe one or two others if the movie was disproportionately popular in certain regions.


Imagine including the search term "movie".


That doesn't do anything useful.


?q=imdb.com:fall of the roman empire


!imdb


Wow I used "personality test" and actually got useful articles about personality theory. I'll actually use this!


I think it's a case where systems diversity can be an advantage. Much like how most malware was historically written for Windows and could be avoided by using Linux, the low-quality search engine bait is created for Google and can be avoided by using a different style of search engine.


Interesting choice of search topic. Are you trying to make an additional point?


No one mentioned the "bonus" audio in the page source: https://www.youtube.com/watch?v=7fCifJR6LAY


;-)


I tried some queries for Harry Potter fanfictions, and the results were pretty much completely unrelated. There weren’t that many results, either.


I'm curious what you searched for.

https://search.marginalia.nu/search?query=harry+potter+fanfi...

This seems to return a pretty decent number of sites relating to that (as well as some sites not relating to that).

The search engine isn't always great at knowing what a page is about, unfortunately.

This seemed to return mostly relevant results

https://search.marginalia.nu/search?query=%22harry+potter%22...


Yes, shorter queries return more relevant results. I think this was the first query that came to my mind:

https://search.marginalia.nu/search?query=Best+%22harry+pott...


Yeah, that's just not a type of query my search engine is particularly good at. It's pretty dumb, and just tries to match as much of the webpage against the query as it can.

This used to be how all search engines worked, but I guess people have been taught by google that they should ask questions now, instead of search for terms.

I wonder how I can guide people to make more suitable queries. Maybe I should just make it look less like google.


I had the exact opposite experience. I searched the site for "java", got a Wikipedia link first (for the island, not the programming language), and the 2nd result was to a random JEP page, and all the rest of the results were random tidbits about Java (e.g. "XZ compression algorithm in Java). Didn't get any high level results pointing to an overview of the language, getting started guides, etc.


You need to use some old school search techniques and search for “Java overview”


I'm not sure that's a bad thing.


well, they're results to java related items...

What kind of links where you expecting to find?


If this search engine ever takes off, the listicle writers will just start optimizing for it too, right?


Mission accomplished, then.


If the goal was to remove modern web design, ok sure mission accomplished.

If your goal was to create a search engine that ignored listicles and other fluff and instead got you meatier results like "academic talks" and such, then no.


When a measure becomes a target, it ceases to be a good measure.

https://en.wikipedia.org/wiki/Goodhart%27s_law


I did a search for "George Washington"

First result after Wikipedia:

"Radiophone Transmitter on the U.S.S. George Washington (1920)

In 1906, Reginald Fessenden contracted with General Electric to build the first alternator transmitter. G.E. continued to perfect alternator transmitter design, and at the time of this report, the Navy was operating one of G.E.'s 200 kilowatt alternators http://earlyradiohistory.us/1919wsh.htm "

Another result in the first few:

" - VANDERBILT, GEORGE WASHINGTON

PH: (800) ###-#233 FX: (#03) 641-5###. https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_... "

And just below that terrible result:

"I Looked and I Listened -- George Washington Hill extract (1954)

Although the events described in this account are undated, they appear to have occurred in late 1928. I Looked and I Listened, Ben Gross, 1954, pages 104-105: Programs such as these called for the expenditure of larger sums than NBC had anticipated. It be http://earlyradiohistory.us/1954ayl2.htm "

Dramatically worse than Google.

---

Ok, how about a search for "Rome" then? Surely it'll pull some great text results for the city or the ancient empire.

First result after Wikipedia:

"Home | Rome Daily Sentinel

Reliable Community News for Oneida, Madison and Lewis County http://romesentinel.com/"

The fourth result for searching "Rome":

"Glenn's Pens - Stores of Note

Glenn's Pens, web site about pens, inks, stores, companies - the pleasure of owning and using a pen of choice. Direcdtory of pen stores in Europe. http://www.marcuslink.com/pens/storesofnote/roma.html"

Again, dramatically worse than Google.

---

Ok, how about if I search for "British"?

First result after Wikipedia:

"BRITISH MINING DATABASE

British_Mining_Database http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "

And after that:

"British Virgin Islands

Many of these photos were taken on board the Spirit of Massachusetts. The sailing trip was organized by Toto Tours. Images Copyright © Lowell Greenberg Home Up Spring Quail Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout, Oregon Wahkeena http://www.earthrenewal.org/british_virgin_islands2.htm"

Again, far off the mark and dramatically worse than Google.

I like the idea of Google having lots of search competition, this isn't there yet (and I wouldn't expect it to be). I don't think overhyping its results does it any favors.


What were you expecting to see for British? There must be millions of pages containing that term. Anyway the first screenful from Google is unadulterated crap, advertising mixed with the usual trivia questions.

If you are going top claim something is wide of the mark then you really ought to tell us at least roughly where the mark is.


This is not a Google competitor, it's a different type of search engine with different goals.

> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?


I checked the results of the same query and they seem fine. Lots of speeches and articles about George Washington the US president. There's even his beer recipe.

As for the results you linked, it's part of the zeitgeist to list other entities sharing the same name. Sure, they could use some subtle changes in ranking, but overall the returned links satisfy my curiosity.


Yeah, Google tends to send a lot of junk back.


Yeah so this is my project. It's very much a work in progress, but occasionally I think it works remarkably well for something I cobbled together alone out of consumer hardware and home-made code :-)


It's very rare that I see a project on HN I can see myself using. This is one. Like others have said, the results can be a little rough. But they're rough in a way I think is much more manageable than the idiosynchrosies of more 'clever' search engines.


I think you need to approach it more like grep than google. It's a forgotten art, dealing with this type of dumb search engine.

Like if you search for "How do I make a steak", you aren't going to get very good results. But a better query is "Steak Recipe", as that is at least a conceivable H1-tag.


This is exactly how I prefer to use my search engines.


I searched like this all my life and always got expected results.

But just a week ago I found out that these "how", "what" questions give better and faster results on Google.


That switch happened some years ago. I've been unlearning and relearning how to use google for what feels like at least three or four years now.

The main pain-point, though, is that a lot of long-tail searches you could've used to find different results in years past, now seem to funnel you to the same set of results based on your apparent intent. At least, it has felt that way -- I'm not entirely sure how the modern google algorithm works.


I realized this a few years ago when I observed my wife find things faster on Google than me.

I appreciate that it is easier for newcomers but I still hate it personally after years and especially that they cannot even avoid meddling with my queries even when I try to accept the new system and use the verbatim option.


Try your old Google Fu skills on DuckDuckGo (or Bing I guess). I've found it to have good results anyway


> I think you need to approach it more like grep than google. It's a forgotten art

A search engine that accepted regex as the search parameter would be amazing.

I actually used this method as a field filter for a bunch of simple internal tools to search for info. Originally people were asking for individual search capabilities, but I didn't want it to become a giant project with me as the implementor of everyone's unique search capability feature request - so I just gave them regex, encoded inputs into the URL query string so they can save searches - gave em a bunch of examples to get going and now people are slowly learning regex and coming up with their own "new features" :P

But this made sense because it's a relatively small amount of data, so small that it's searched in the front end which is why it's more of a filter... I don't think pure regex would scale when used as a query on a massive DB, it would need some kind of hierachy still to only bother parsing a subset of relevant text... unless there is some clever functional regex caching algorithm that can be used.


So, you are re-implementing Altavista, Lycos and other old search engines.

They used the naive approach: you searched for "steak", and they would bring the pages which included the word "steak".

The problem is that people could fool these engines by adding a long sequence like "steak, steak, steak, steak, steak, steak" to their site -- to pretend that they were the most authoritative page about steaks.

Google's big innovation was to count the referrers -- how many pages used the word "steak" to link to that particular page.

The rest is history.


Effective Google search is also history.

I understand they are trying to maximize ad revenue and search does work very well for people who are looking for products or services.

But it no longer works well for finding information that is even slightly obscure.


> The problem is that people could fool these engines by adding a long sequence like "steak, steak, steak, steak, steak, steak" to their site -- to pretend that they were the most authoritative page about steaks.

I don't see a lot of people investing in SEO to boost their Marginalia results.


> Google's big innovation was to count the referrers -- how many pages used the word "steak" to link to that particular page.

Then people fooled Google into showing the White House as top result when searching for "a miserable failure".

At the moment marginalia's approach of sorting pages into quality buckets based on lack of JS seems to be working extremely well, but of course it will be gamed if it gets popular.

However, I'd rather want SEO-crafting to consider itself with minimizing JS, rather than spamming links into every comment field on every blog across the globe ;-)


Love it, kudos! This is great for developers and others who Just Need Answers and not shopping or entertainment.

If you're looking for feedback, both from a UI design and utility standpoint, you might consider "inlining" results from selected sites, e.g. Wikipedia, stacked change, etc. Having worked on search for a long time, inlining (onebox etc) is a big reason users choose Google, and that channelers fail to get traction. If you're Serious(tm), dog into the publisher structure formats and format those, create a test suite, etc.

A word of caution: if this takes off, as a business it's vulnerable to Google shifting its algorithms slightly to identify the segment of users+queries who prefer these results and give the same results to those queries.

Hope this helps!


If Google starts showing interesting text-heavy links instead of vapid listicles and storefronts, I have accomplished everything I ever could dream of.


Google Info - for when you're looking for information, not shopping advice or lists!


Maybe you're joking, but this is a good idea for search engine. Better: Credible info.


Google info? Can you give me a sample query of what you mean?


It was a joke. The joke being that Google should launch at new product called Google Info, that would actually give you information when you search.


Haha, reminds me exactly of this.

https://xkcd.com/810/


You can check the Web Vitals score of Google SERP-s using Core SERP Vitals (https://chrome.google.com/webstore/detail/core-serp-vitals/o...) and filter out the worst results.


Thank you for doing this important work.


haha, great answer! thanks for your work on this :)


Very cool project! How many websites do you have in your index? And how did you go about building it?

I've been working on an engine for personal websites, currently trying to build a classifier to extract them from commoncrawl, if you have any general tips on that kind of project they'd be very welcome.


About 21 million, and I'm crawling myself.

Classification is really hard. I'm struggling with it myself, as a lot of like privacy policies and change logs turns out to share the shape of a page of text.

I'm thinking of experimenting with ML classifiers, as I do have reasonably good ways of extracting custom datasets. Finding change logs and privacy policies is easy, excluding them is hard.


If you're open to sharing your index I could make a classifier for you, I do this for a living. It's more of the indexing and search engine part which have been a problem for me. That's why I'm working from commoncrawls.


This is awesome. I've been looking for a long time for a search engine that basically takes everything Google does and does the opposite. Thank you for doing this, I will definitely be bookmarking it.

Is there a way to suggest or add sites? I went looking for woodgears.ca and only got one result. I also think my personal blog would be a good candidate for being indexed here but I couldn't find any results for it.


I would also love to add some sites which I were missing...

Unfortunately this doesn't seem to be a feature which new search engines are focusing on - Brave Search also misses that feature...


This has amazing potential. I'd encourage you to form a non-profit, turn this into something that can last as an organization without becoming what you're trying to avoid becoming. This is a good enough start that I bet you could raise a sizeable startup fund very soon from a combination of crowdfunding and foundation grants—I bet the Sloan Foundation would love this!


This is absolutely wonderful. I am LOVING the results I'm getting back from it: the sort of content-rich sites that have become nigh unreachable using traditional search engines. Thank you for building this!


I love this, and I love (many of) the results so far! What I can't find on the site is detail about what "too many modern web design features" means. Is it just penalizing sites with tons of JavaScript?


Javascript tags are penalized the hardest, but it also takes into consideration density of text per HTML. There's also some adjustments based on text length, which words occur in the page, etc.


Which software do you use to index the sites?


I wrote it myself from scratch. I have some metadata in mariadb, but the index is bespoke.

A design sketch of the index is that it uses one file with sorted URL IDs, one with IDs of N-grams (i.e. words and word-pairs) referring to ranges in the URL file; as well as a dictionary for relating words to word-IDs; that's a GNU Trove hash map I modified to use memory map data instead of direct allocated arrays.

So when you search for two words, it translates them into IDs using the special hash map, goes to the words file and finds the least common of the words; starts with that.

Then it goes to the words file and looks up the URL range of the first word.

Then it goes to the words file and looks up the URL range of the second word.

Then it goes through the less common word's range and does a binary search for each of those in the range of the more common word.

Then it grabs the first N results, and translates them into URLs (through mariadb); and that's your search result.

I'm skipping over a few steps, but that's the very crudest of outlines.


Good stuff. I've also been toying with doing some homegrown search engine indexing (as an exercise in scalable systems), and this is a fantastic result and great inspiration.

Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'.


Well just crunching the numbers should indicate what is possible and what isn't.

For the moment I have just south of 20 million URLs indexed.

1 x 20 million bytes = 20 Mb.

10 x 20 million bytes = 200 Mb.

100 x 20 million bytes = 2 Gb.

1,000 x 20 million bytes = 20 Gb.

10,000 x 20 million bytes = 200 Gb.

100,000 x 20 million bytes = 2 Tb.

1,000,000 x 20 million bytes = 20 Tb.

This is still within what consumer hardware can deal with. It's getting expensive, but you don't need a datacenter to store 20 Tb worth of data.

How many bytes do you need, per document, for an index? Do you need 1 Mb of data to store index information about a page that, in terms of text alone, is perhaps 10 Kb?


What crawler are you using and what kind of crawling speeds are you achieving?

How do you rank the results (is it based on content only) or you have external factors too?

What is your personal preferred search option of the 7 and why?

Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.


> What crawler are you using and what kind of crawling speeds are you achieving?

Custom crawler, and I seem to get around 100 documents per second at best, maybe closer to 50 on average. Depends a bit on how many crawl-worthy websites it finds, and there is definitely diminishing returns as it goes deeper.

>How do you rank the results (is it based on content only) or you have external factors too?

I rank based on a pretty large number of factors, incoming links weighted by the "textiness" of the source domain, and similarity to the query.

> What is your personal preferred search option of the 7 and why?

I honestly use Google for a lot. My search engine isn't meant as a replacement, but a complement.

> Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.

Are you kidding? I think the Patreon is a resounding success! I'm still a bit stunned. I've gotten more support and praise, not just in terms of money but also emails and comments here than I could have ever dreamed possible.

And this is just the start, too. I only recently got the search engine working this well. I have no doubt it can get much better. The fact that I have 11 people with me on that journey, even if they "just" pay my power bill, that's amazing.

I'm honestly a bit at a loss for words.


You have a great attitude!

And I am not kidding. I think for something that got so much attention on HN, where realistically this kind of product can only exist for now, the 'conversion' rate was very low. Billion dollar companies were made of HN threads with lot less engagement. Makes me wonder do we really want a search engine like this or we just like the idea of it?

And what are the barriers to use something like this? You say yourself that you are using Google most of the time. Is jumping to check results on this engine going to be too much friction for most uses?

Can something like this exist in isolation? What kind of value would it need to provide for users to remember using it en-masse as an additional/primary vertical search like they do for Amazon?

Just thinking out-loud as I am also interested in the space (through http://teclis.com).


I think in part it may just be because I'm not trying to found a start-up, and I'm not trying to get rich quick. If I were, I would have dealt with this very differently. My integrity and the integrity of this project is far more important than my bank balance. Not everyone feels that way, and I can respect that, but I do.

Ultimately I think running something like this for profit would create really unhealthy incentives to make my search engine worse. Any value it brings, right now, it brings because it isn't trying to cater to every taste and every use case.

I also hate the constant "don't forget to slap the like and subscribe buttons"-shout outs of modern social media, even though I'm aware they it is extremely efficient. If I went down that route, I would become part of the problem I'm trying to help cure. I do feel the sirens' call though, it's intoxicating getting this sort of praise and attention.

I want this to be a long-term project, not a some overnight cinderella story.

In the end, my search engine is never going to replace google. It isn't trying to, it's trying to complement it. It's barely able now, but hopefully I can make it much better in the months and years to come.


I think it's good not to have to depend on financial compensation for every single thing in your life, if you can be comfortable or do well otherwise.

This allows quite a bit of its own kind of freedom even if maximum financial opportunity is not fully exploited. Perhaps even because you are not grasping for every dollar on the table at all times.

You can do things without having to know if they will pay off, and if it turns out big anyway you can make money as a byproduct of what you do rather than having pure financial pursuit be the root of every goal.


I agree with everything you say. The 'subscribe and like buttons' would not help your conversion with HN readers, on the contrary. Trying to run this for profit also would not help your conversion with this audience.

So given your setup is already ideal for 'conversions' for this population (low profile, high integrity, no BS) I was simply genuinely surprised that only 11 people converted given enormous visibility/interest this thread had. Hope that makes sense.


I think it simply takes time to build trust. The threshold to sending someone money is high. I probably wouldn't send someone money based on a proof of concept and lofty ambitions alone.

I'd absolutely consider sending someone money if they kept bringing something of value into my life. If I want more people to join the patreon, I'll just have to earn their trust and support.


The day Google first appeared on the full internet it was excellent of course because it had no ads.

Plus another excellent feature was you would get the same search results no matter who or where you were for quite some period of calendar time.

If something new did appear it was likely to be one of the new sites that was popping up all the time and it was likely to be as worthwhile as its established associates on the front page.

You shouldn't need to crawl nearly as fast if you can compensate by treading more suitably where those have gone before.


Interesting, in my database (http://root.rupy.se) I have one file per word that contains the ids (long) of the nodes (URLs), so to search many words together I have to go through the first file and one by one see if I find matches in the second.

How does the range binary search work, does it just prune out the overlaps, how efficient is it and how much data do you have in there for say "hello" and "world" f.ex?


I’m not sure how you go from word to url range? Range implies contiguous, but how can you make that happen for a bunch of words without keeping track of a list of urls for each word (or URL ids, the idea is the same)?


The trick is that the list of URLs for each word already is in the URLs file.

The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence.

Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8.

The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8.

If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap.

This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10)


Hm, ok I understand more but how do you perform the "binary search", just loop over the URL ids?

Funny I also selected "hello" and "world" above! Xo

My system is also written in Java btw!

Here are example results of my word search:

http://root.rupy.se/node/data/word/four

http://root.rupy.se/node/data/word/only

etc.


I'll get back to your email in a while, I've got a ton of messages I'm working through.

But yeah, in short pseudocode:

  for url in range-for-"hello":
    if binary-search (range-for-"world", url):
      yield url
I do use streams, but that is the bare essence of it.


So every time you insert a new URL for a word you have to update the range for every other single word since the URL file will be shifted?


Are the n-grams always at most n=2 bigrams?


No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.


Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?


It's a great project!


I continue to want a communal set of 50-100 million urls and data that are "good" (i.e. for any value of good and more complete than common crawl) that are accessible enough to be easy to work with that can be used to experiment with different web search tech. There are enough separate problems to tackle in web search that breaking it down would maybe move the needle. We have lots of kaggle competitions about ranking, but using closed data. What other types of kaggle projects would help web search?

What can we do to foster a sustainable bazaar of projects to make it easier to build web search engines?


How are you doing the crawling without getting blocking? -- the hardest part.


Not OP but crawling is easy if you don't try scanning 5+ pages a second - almost all rate limiting/heuristic based 'keep server costs low' engines, including Cloudflare, don't care if you request every page, but will take action if you do something like burst every page and take up just as many server resources as a hundred concurrent users.

Now, that is assuming you aren't on some VPS provider. If you're going to crawl, you'll have the best chance when you use your own IPs on your own ASN, with DNS and reverse DNS set up correctly. This makes it so the IP reputation systems can detect you as a crawler but not one that hammers every site it visits.

Also, I imagine that, for a search engine like this, it doesn't expect content to change much anyways - so it can take its time crawling every site only once every month or two, instead of the multiple times a week (or day) search engines like Google have to for the constantly-updated content being churned out.


Pretty neat!!!

You may already be aware of this, but the page doesn't seem to be formatted correctly on mobile. The content shows in a single thin column in the middle.


Hmm, which OS? I only have a single Android phone so I've only fixed the CSS for that.


I was seeing it on Android w/ Firefox. Seems like it's fixed now though. :)


Curious, I haven't touched the stylesheets.


For example Firefox on Android.


Fennec F-Droid on Android 11 has some rendering issues.


Really like it, and love that you've done this yourself.

I'd prefer if it does just one thing and does that really well. Don't waste your time on calculator and conversion functions, or pseudo-natural language queries. There are plenty of good calculator & converter tools and websites, but we all need a really good search engine. I think you'd be better looking at handling synonyms and plurals.

Thanks.


I've mostly added these converters and frills because I'm trying to eat my own dogfood, and google is deeply engrained as my go-to source for unit conversions and back-of-envelope calculations.

Don't worry, this stuff is easy, and doesn't even remotely take away from the work on the harder problems.


Awesome project! How are you able to keep the site running after HN kiss of death? What is your stack, elastic search or something simper? How did you crawl so many websites for a project this size? Did you use any APIs like duck duck go or data from other search engines? Are you still incorporating something like PageRank to ensure good results are prioritized or is it just the text-based-ness factor?


> How are you able to keep the site running after HN kiss of death?

I originally targeted a Raspberry Pi4-cluster. It was only able to deal with about 200k pages at that stage, but it did shape the design in a way that makes very thrifty use of the available hardware.

My day job is also developing this sort of highly performance java applications, I guess it helps.

> What is your stack, elastic search or something simper?

It's a custom index engine I built for this. I do use mariadb for some ancillary data and to support the crawler, but it's only doing trivial queries.

> How did you crawl so many websites for a project this size?

It's not that hard. Like it seems like it would be, and there certainly is an insane number of edge cases, but if you just keep tinkering you can easily crawl dozens of pages per second even on modest hardware (of course distributed across different domains).

> Did you use any APIs like duck duck go or data from other search engines?

Nope, it's all me.

> Are you still incorporating something like PageRank to ensure good results are prioritized or is it just the text-based-ness factor?

I'm using a somewhat convoluted algorithm that takes into consideration the text-based-ness of the page, but also how many incoming links the domain has, but it's a weighted value that factors in the text-based-ness of the origin domains.

It would be interesting to try a page rank-style approach, but my thinking is that because it's the algorithm, it's also the algorithm everyone is trying to game.


Thank you so much for creating such a useful search engine!

Is there any way that you can get an HTTP certificate?

I use an old iPhone 4S, and most of the modern web is inaccessible due to TLS. Hacker News and mbasic.facebook are two of the last sites I can use.

Usually text-based sites are more accessible, so this could be really useful to help me continue using my antique devices!


is there a json endpoint ? I'd love to make an emacs bridge :)


Seconded, I’d like to incorporate it into a project of mine.


Love the idea. A little feedback: layout needs tweaking for mobile. FWIW: I'm on mobile Firefox for Android.


I searched Warcraft and got a gold selling/ level boosting site. Some things never change :)


This is really cool! A spacer between the links would help my old eyes; I keep getting lost in which link goes with which description. :-)


Hi,

Interesting idea. Definitely see an overlap with eReader markets and looking at text only contents.

How does it work?

It ignores pages on which it detects frameworks for ui and ads or any javascript code at all?


I can't see the letters in the disturbing white search box, I'm on duckduckgo, brave, monocles, jquarks, smartcookieweb on Android.


Nice, what are you using to crawl the web?


It's pretty much all bespoke.

I use external libraries for parsing HTML (JSoup) and robots.txt; but that's about it.


What was the starting site you fed to the crawler to follow the links from to build the index?


Just my (swedish) personal website. The first iteration of the search engine was probably mainly seeded by these links:

https://www.marginalia.nu/00-l%C3%A4nkar/

But I've since expanded my websites, so now I think these play a decent role in later iterations, although they are virtually all of them pages I've found eating my own dogfood:

https://memex.marginalia.nu/links/fragments-old-web.gmi

https://memex.marginalia.nu/links/bookmarks.gmi


Nice project, but have you heard of FrogFind? it also presents lightweight search results.


This is a very cool project! Thank you.


What's the tech stack?


It's custom java code, for the most part. I'm using mariadb for some ancillary information, but the index and the crawler and everything is written from scratch.

It's hosted on my consumer-equipment server (Ryzen 3900x, 128Gb ram, Optane 900p+a few IronWolf drives), bare bones on Debian.


I love this idea, and admire the work you put into it. I'm a fan of long reads and historical non-fiction, and Google's results are truly garbage.

I have a criticism that I think may pertain to the ranking methodology. I searched for "discovery of Australia". Among the top results were:

* A site claiming that the biblical flood was caused by Earth colliding with a comet (with several other pages from that site also making the top search results with other wild claims, e.g. that the Egyptians discovered Arizona);

* Another site claiming the first inhabitants of Australia were a lost tribe of Israel;

* A third site claiming that Australia was discovered and founded by members of a secret society of Rosicrucians who had infiltrated the Dutch East India Company and planned to build an Australian utopia...

These were all pages heavy with HTML4 tags and virtually devoid of Javascript, the kinds of pages you'd frequently see in the late 1990s from people who had built their own static websites in a text editor, or exported HTML from MS Word. At that time, there were millions of those sites with people paying for their own unique domain names, and so the proportion of them that were home to wild-eyed conspiracy theories was relatively small. What I think has happened is that kooks continued to keep these sites up - to the point where it's almost a visual trope now to see a red <h1> tag in Times New Roman and think, uh oh, I've stumbled on an "ancient aliens" site. Whereas scholars and journals offering higher quality information have moved to more modern platforms that rely more heavily on modern browsers - with or without their own domain names. So as a result what seemed to surface here were the fragments of the old web that remain live - possibly because people living in cabins in Montana forget to cancel their web hosting, or because the nature of old-school conspiracy theorists is to just keep packing their old sites with walls of text surrounded by <p> tags.

Arguably, this seems to rank the way Google's engine used to, since it couldn't run JS and they wanted to punish sites that used code to change markup at render time. At least, when I used to have to do onsite SEO work, it was always about simple tag hierarchies.

I wonder whether there isn't some better metric of validity and information quality than what markup is used. Some of the sites that surfaced further down could be considered interesting and valuable resources. I think not punishing simple wall-of-text content is a good thing. But to punish more complicated layouts may have the perverse effect of downranking higher-quality sources of information - i.e. people and organizations who can afford to build a decent website, or who care to migrate to a modern blogging platform.


those three pages sound pretty interesting, I don't see this as a problem


I don’t want my search engine to somehow try to judge the believability of the results. I’d like to be the judge of that myself.


Great work. Working on an alternative search engine too. Take a look at my profile.


fantastic project, thank you!


I tried a few searches.

<<javascript pipe syntax>>: none of the search results appeared to have anything to do with Javascript pipe syntax. (Which doesn't exist yet, but it's under discussion.) Google gives a bunch of highly-relevant results.

<<hans reichenbach relativity>>: first result is a list of books about relativity, one of which is Reichenbach's "Philosophy of space and time"; good, but there's no real information there. Second is about Reichenbach but nothing to do with relativity or even, really, philosophy of science. Third is about philosophy of science and mentions some of Reichenbach's work but not related to relativity. Fourth mentions Reichenbach's "Philosophy of space and time" as part of a list of books relevant to a seminar on "time and eternity". None of this is bad, but it's not great either. Google gives a couple of online philosophy encyclopaedia entries, then a journal article on "Hans Reichenbach's relativity of geometry", then the Wikipedia article on Reichenbach ... much more informative.

<<luna lovegood actress>>: I thought this would be an easy one. It was easy for Google, which gave me her name in large friendly letters at the top, then her IMDB entry, and a bunch of other relevant things. Literally nothing in the Marginalia results was relevant to the query.

I guess maybe popular culture is just too monetizable, so no one is going to write about it on the sites that Marginalia crawls? Let's try some slightly less popular culture.

<<wilde "a handbag">>: First result is kinda-relevant but weird: it's about a musical adaptation of The Importance of Being Earnest. It doesn't mention that famous line from the play, but one of the numbers in the musical has the words "a handbag" in the title. Second result is a review of a CD of musicals, including the same work. Third is a bunch of short reviews of theatrical items from the Buxton Festival Fringe, one of which is a three-man adaptation of TIOBE. Next four are 100% irrelevant. Next is a list of names of plays. Last one is actually relevant; it's an article about "Lady Bracknell through the decades". Google puts that one first (after, sigh, a bunch of YouTube videos which look as if they might actually be relevant).

I really like the idea of this, and many of the things it turns up look like they might be interesting, but it isn't doing very well at producing results that are actually relevant to the thing being searched for.


TIL about https://github.com/tc39/proposal-pipeline-operator, which I am immediately looking forward to playing with once it gains traction Some Time From Now™

(I have no earnest reason to transpile)


Yeah, this seems pretty nice. I don't think the "deep nesting" issue is quite so realistic... I very rarely have a logic tree that's easier to identify by its leaves than its root. And I'd really hate to have code where you have to scroll to the end of a bunch of pipes to figure out what they're adding up to

But I have plenty of single use "temp variables" and cutting those out could be cool.


To be fair, those searches are pretty weird.


We don't search for things because they're easy to find


I mean most of my searches are probably pretty easy to find, I just don't want to go to the website I'm thinking of and click through 5 pages to get there.


Uh that has nothing to do with it? I don't immediatly know the URL of a website which shows me info about lamps?


The pop culture one is fairly common. Me and my wife both search "who the fuck is that" in that TV show movie all the time. Or who is the author of X book?


I think perhaps the usefulness here is less finding what you're looking for, but rather finding something interesting.


It’s trying to surface long articles and you’re asking it for a one word answer. What did you expect? A long article consisting of “Emma Stone played Cruella” repeated 800 times?


They seem reasonable to me.


i don't know, people here like to complain about google, but google still works pretty well for me.


I was going to write a response, but this comment I found further down the page sort of communicates my perspective: https://news.ycombinator.com/item?id=28552078


Fascinating. I studied an "obscure" group of insects. My go-to search term to test an engine is their family name as it is a rarely used word and I know most (all?) of the major data sources that have accumulated data on it. When Wolfram Alpha added species names, I checked with the name, boring, Duck Duck, boring, Google (well we know Google isn't for search anymore, it's absolutely horrible) boring, Bing, boring... you get the idea.

This was a little different, extremely few results, but a couple of them really made me grin, and all(?) made me curious or raise an eyebrow or reflect on who/what might have been the source of the link, or remember some obscure connection from grad-school. So, if anything a crawled list of results worthy of ponder, thanks for this!


Likewise, I co-maintain the only "fan" site on one of my all-time favourite composers/performers, and gave the engine a shot with a unique string query. While my text-heavy WP-driven site didn't seem to make the cut, the results were highly relevant in that they were links to former band members and collaborators - a couple of which I didn't realize existed. That being said, there were a few sites (including my own) I expected to be returned, but no dice. Still, a fascinating experiment that many at HN have been clamouring for.


The search engine doesn't actually do full text search, so maybe your query was too... unique.

But do first of all verify that you haven't been hacked. There's about quarter of a million domains I've flagged that, besides their wordpress content, also host a ton of link spam crap off in some hidden folder. This reflects on the quality rating extremely negatively to the point where you may have not been indexed at all.

Secondly, are you behind cloudflare or some other big-name CDN? Because, as I mentioned in another comment, I can't crawl their pages without getting captchad until they approve of my humble request to be classified as a good bot.

There are some other hosting providers I flat out block on a subnet level because they host a large amount of link farms. This is currently Alibaba, Psychz, eSited, Cloud Yuqu and 1Blu.


It’d be nice if you had a page to get the current index status for a domain.


Try a query on the form site:www.example.com ;-)


Would it be possible to have a link to a page with operators?


> site:www.washingtonpost.com

> Blacklisted false

> site:www.wsj.com

> Blacklisted false

> site:www.rt.com

> Blacklisted false

> site:www.nytimes.com

> Blacklisted true

?


Hmm, not sure what caused it to end up there, but I removed it from the blacklist. It still doesn't seem to want to index the domain however, probably CDN-related.


Thanks for the advice; not hacked, but I have "resurrected" many WP sites that have been (including my wife's non-profit). Just running on an EC2 micro instance, but I tried adding "site:" and received "No such domain". Actually, I think it's because I haven't enabled "HTTPS" yet! That's on my to-do along with migrating off EC2-Classic to VPC...


Vanilla HTTP should be fine. I think 80% of the urls are HTTP.

If you're getting no such domain, it's either blocked because it looks too much like a spam domain, or it simply hasn't been discovered yet.

What's the TLD? I severely restrict some cheaper TLDs because they gave so much spam.

For example, cr.yp.to is an example of a baby I know I've definitely thrown out with the bathwater.


Is a good ol' .com with no ads and minimal JS - originally launched in 2011. Thanks again for your insights; I've bookmarked your site and will check back every so often to see if my site's been indexed.


www.ft.com gets 'no such domain'


I added it now, but it turns out it's behind a CDN so I still can't crawl it.


Thanks for responding and especially thanks for the search engine. What a breath of fresh air, and access, it feels like, to real people.


Yeah, that's a large part of what I'm trying to accomplish. Great to see others understand as well.


Exactly this. A couple results returned reference to obscure now-defunct newsletters and clubs, people that I know were historically important for past researchers, but only because this was my research forcus for so long would I have known this.


I'm intrigued by this experiment but I can't visualize it. What do you mean by boring results? Would combing through a library (the one with paper books) also produce boring results? What's your ideal results?


Perhaps a counter example, something that is interesting. Anecdotally. This, of all things, is the top result in my search: https://tft.brainiac.com/archive/0303/msg00037.html. Which is strange to me because I don't recognize tft.brainiac. I click, it's a list of biological relationships among Hymenoptera, including a reference to genus of the wasps I studied, presumably in a biological relationship (host/parasite) context. I cataloged every relationship known at one point, so my brain wants to know where this come from, is it something I caught. Then I go look for more context, and find it's part of a thread about D&D(?) and hymenoptera, and it's epic, and a chunk of my morning is lost figuring out why and how this came to be.


Yes, thanks. That helps.

If I understand it correctly, you're interested in bits and pieces of new information that's indirectly related to your object of interest. Degree 2 and 3 in Six Degrees of Kevin Bacon, so to speak. You know degree 0 like the back of your hand and you've seen almost everything closely connected. Finding novel, interesting things is getting more difficult.

Have you thought about cataloging all the related stuff you stumble upon? Something in between loose notes and what Moby Dick is to cetology.


Exactly.

> Have you thought about cataloging all the related stuff you stumble upon? Something in between loose notes and what Moby Dick is to cetology.

Tongue in cheek- new app time, to facilitate this. It should have the name "Degree4". Entries can only be made if degrees 2 and 3 are "defined". Scoffs at degrees 5 and 6, just because. Startup developing can probably unethically seed content by mining https://www.everything2.com/. Should use concepts of "AI" and "persistent homology"... profit!

But no, I don't outside a mental note. Closest I would come would be adding '!! <some note>' to my potwiki text notes (see my past comments) if its something I want to have come back with a grep, or think might be interesting to explore "when I retire". If it's a scientific fact in my field after researching it further it would go into this https://taxonworks.org (or its precursor).


In part, by boring results I mean I instantly recognize the top results, and I know exactly what will be in them, and I know which ones will actually contain potentially interesting new stuff, i.e. _I didn't have to search for these, I'd go their directly_. Then next results are all obscure, and I've already visited them, and/or I know they are historical and not something I have to revisit.

With this engine with at least 1/2 the links (to be fair there were < 20) I didn't recognize the URL at all, and it was clear in the text or the URL that there was an interesting bit to check out (i.e. what Google should have also returned after they barfed out the things I don't need to know about), but had never succinctly done in my experience.

I suppose the magic in this engine would have to be alerting the searcher that they found more of this type of link, as once I visited the 10 or so sites they would fall back into the "been there, done that" link category that Google appends somewhere after the ads and "big" sites, mixed in with a million search term spam sites, etc.


There’s certain grey literature that’s not captured in university library federated searches nor easily found with mainstream search engines.


There are decades of academic research not digitized. The digitization window used to only hit around 1990, I haven't looked at it hard recently, but I suspect this still remains true for many important journals. This is grey only to those who do not know how to use a library.


> well we know Google isn't for search anymore

Do you suggest anything better? As far as I can tell, all the other search engines are either repackaged Bing (ie: DuckDuck), or are just as bad.


Not sure, but I remember when Google could find literally anything. Then they started adding a bunch of exceptions and crapped out their quality. I wonder how insanely different results would be to get the older Google Engine from the 2000s search result wise.

I now have to play games with Google to find things. I feel like I do less than I used to for some reason.


The other day, I was searching for something, and google's suggested, on-site answers took up 1/2 the first page. All wrong.

The actual search results were another 1/4 page of completely identical results, followed by google ad placed search results.

I thought to myself, they've finally done it. Real responses are no longer first page.

A lot of the cause for google getting crappy, is "ok google", another "all platforms are the same" form of sickness.

No, a desktop is not a phone. No, voice searching is not the same as phone, or desktop.


I was just thinking that they finally became Lycos. It's what all the search engines except Google looked like back in the early 2000s - ad laden cesspools of irrelevant search results and other content. And it's why we all switched to Google at the time.


It's time to disrupt the market. As Google can't compete with a newcomer that penalize ads on page.


Seriously, yes.

Moores law means a modern day 2007-style Google should be significantly less expensive to run now than back then.

Also the most relevant patents are now free to use.

2021 Google is a sad story compared to 2007 Google and I'd actually pay to get back 2007 Google - ads included - meaning a double revenue source :-)


Wacky idea: instead of Google changing it's algorithm every couple of years, it could run 50 algorithms in parallel leaving no way for sites to "optimise" for the current one.


The output of the parallelism is itself an algorithm, that can and will be optimized for.


You're absolutely correct, and a lot came from their nerfing of search modifiers like + - "search term" and whatnot. There's also a lot of ads and "PSA" type nonsense. If I'm looking for anything COVID related for example, I have to sift through a heap of PSA nonsense that's not even related to my search query.


This needs an update but is an easy look see. https://www.searchenginemap.com/

Broad and longer Twitter lists maintained here: https://twitter.com/SearchEngineMap/lists


Mojeek was built in the same spirit (one server living in a house) and has 4.5 bn pages indexed now, and a bunch more servers. A lot of people comment in similar style of it reminding them of an older Internet, or generally less branded results. It's definitely an alternative point of view. Disclaimer: I work for them.


IMHO, that is a trendy claim in HN with little evidence.


You want evidence? Search for a plumber/tradesperson in your area THEN try to find rational discourse about your options. There are literally 100s of results of websites remixing a small set of data, presenting it to you, and asking you to buy something to see more, when you know there is nothing behind the scenes.

This type of engine would punish these sites, in theory, and may turn up a discussion in some forum, newsgroup, etc. that is actually relevant, or insightful.


> Search for a plumber/tradesperson in your area THEN try to find rational discourse about your options.

I searched "plumber Austin TX" in Google and got a map and list of company websites near me. There are a lot of "top x y in z" list sites, but the top results were still the most relevant. I don't know what "rational discourse" I'm expected to find, though, or why I should assume the discourse I would find through Google is less rational than discourse I would find elsewhere.

I searched the same thing in OP and found nothing even remotely significant. Not even anything related to plumbing.

OP's project isn't optimized for relevance, it's optimized for nostalgia - providing a filter that keeps the modern web away and dropping quirky, interesting breadcrumbs to distract you and remind you of what it was like to wander around the web of the 90's.

Which is all well and good if that's what you want, and judging from the comments it is what a lot of people here want, but Google giving me a list of company names, numbers, websites and a map showing their location by distance is more useful, even if it uses "modern web design" and javascript.


> I searched "plumber Austin TX" in Google and got a map and list of company websites near me.

I think you could have done this historically in a Yellow Pages phone book. My OP used "boring". A list of plumbers is boring, been done on dead wood. I'm not saying boring != !useful.

> There are a lot of "top x y in z" list sites

This is an understatement. I actually want to know the top x in y, to do that I need "rational discourse". Rational discourse is recognizable as well written, insightful, humble, reflective, self-countering, anecdotal etc. By "search is terrible" I mean with respect to finding this.

> OP's project isn't optimized for relevance, it's optimized for nostalgia

Nostalgia is highly relevant if it's on topic, but agreeing with you as to what this engine is about.


>Rational discourse is recognizable as well written, insightful, humble, reflective, self-countering, anecdotal etc. By "search is terrible" I mean with respect to finding this.

I believe a search engine that ignores results based on superficial and aesthetic qualities like "modern web design" would be even worse in that regard, unless you're assuming no relevant discourse about any subject has taken place on the web since the early 2000's.

I admit, I have no idea what heuristic you would actually use to find "well written, insightful, humble, reflective, self-countering, anecdotal etc" content, but I've seen it on modern sites (even on Twitter,) and I've seen a lot of garbage on old sites, so a simple text search of only old websites doesn't seem like it.

It is fun, though.


> I believe a search engine that ignores results based on superficial and aesthetic qualities like "modern web design" would be even worse in that regard

I think you may be confused. No one's trying to replace Google here. The idea is to have another option for when Google craps out on you. And if you find that Google almost always craps out on you, then hey, maybe you're in the 5–10% who have just found their new default search engine :-)


well it displays a map of plumbers in my area, is it not useful? Besides do you remember what it was displaying before "it became useless"? This whole thread is full of hand wavy claims with pretty much no good examples about how Google actually became worse in time. Hence my point.


You're downvoted but in my experience I have never really been burned by this Google-decline


"I haven't seen a black swan, ergo it's not real."

I've been burned by this decline in the past.

From creepy results i.e. first suggestion before typing was something I spoke near the Android and I never searched for before; to not finding what I was searching for before successfully, Google has started declining.


It would be nice if you provided a few real examples so that we would see how Google was so fantastic and found everything magically but then went to shit.


Read this comment from up the page: https://news.ycombinator.com/item?id=28552078


I read it. It doesn't offer any proof, nor it is a good example at all.


You are a lobster. (or frog, depending upon parable)


Google Search is a fantastic product because it's essentially Spotlight for the web. It's by far the fastest way to get to things you already vaguely know are there and acts as a metasearch for large sites.

But as a result it's now less useful as a tool for scouring the web.


> well we know Google isn't for search anymore,

If you're talking about the ads, I would bear in mind Google's whole business model is basically online advertising. Search is just the vehicle to deliver those ads; I'd say Google is pretty good at throwing things back.


But what’s their UVP? I’d say quick and relevant search results. And that seems to be constantly degrading.


Well, the unique value proposition is their gigantic index, really fast search and a bunch of other things.

I'm not sure about the quality of the results, I just use DuckDuckGo these days, but IMO the unique technical advancements are pretty unique to Google.


> I just use DuckDuckGo these days

Then w.r.t. quality of results, you're just using Bing: https://help.duckduckgo.com/duckduckgo-help-pages/results/so...


I like it.

Coincidentally, the other day I was daydreaming about a search engine that favors sites that are updated less frequently. The thought being, the kinds of labors of love that characterized the 1990s Web that I still sometimes miss are still out there, it's just harder to find them amidst the flood of SEO dreck. So perhaps they could be made discoverable again with the help of a contrarian search engine that specifically looks for the kinds of things that Google and Bing don't like to see.


Similarly, I wish there was a recommendation engine (for web, music, movies, whatever) that can show you what is the furthest away from your existing tastes. I've learned to re-create my Spotify account once every 6 months or so, as their recommendation engine becomes a boring machine after using it daily for some months.

I'd love to discover new content that is different from what I read/watch/listen to now, but it's really hard to know about genres you don't know about.


It's hard, though. I simultaneously want something far from my tastes, but I don't want to see Plandemic-style Ivermectin material, or Focus On The Family-style material. I want things that will push me out of my comfort zone sometimes, but it turns out I really don't want the thing furthest from my tastes; I want things marginally adjacent. I want them close enough to feel familiarity, but far enough that it challenges my worldview.

I don't think a recommendation engine can do that.


Doing that takes real work and curiosity. I'm afraid an algorithm will never be able to do it, particularly if you're into niche stuff. For instance I enjoy a lot a Japanese band called The Boredoms - but few people like it, and there's only 2 of their albums available in spotify.


Million Short [1] offers an option to omit results from popular domains. It's a different approach from what you describe, but I think the goal is similar.

[1] https://millionshort.com/


I think I like this method the most, honestly. There's something to be said for minimalist web design, but there's also something to be said for everything we've learned since the 90s.


Try this https://wiby.me/

Clicking 'Surprise me' gave me an interesting article from 1994 http://milk.com/wall-o-shame/bucket.html


That was a great read, thanks for sharing.


I had this problem recently trying to fix an Atari. There's a guy out there who has ton's of guides on doing video out mods but newer guide references the older. However googling the OG guide didn't find it so I manually scoured his old web page.


I like the idea. Tangentially, I wonder how one would find the right 'penalty' for more updated sites?


This is stunning. I searched "winemaking" because it's my latest obsession, and turned up dozens of links to high-quality pages I'd never seen despite spending an hour a day for three months cruising Google on the topic.

Please do announce it here if you ever decide to solicit help or contributors. My stab at this problem was to have a search index of only ad-free pages, on the hypothesis it would turn up self-hosted blogs, university personal pages, that sort of thing. But the results were too thin, your approach is much better.


or if he wants investors. I'd invest few hundred k in this


Add an OpenSearch file: https://developer.mozilla.org/en-US/docs/Web/OpenSearch

Maybe something like:

  <OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"
                         xmlns:moz="http://www.mozilla.org/2006/browser/search/">
    <ShortName>Marginalia</ShortName>
    <Description>Marginalia Search - a search engine that favors text-heavy sites and punishes modern web design</Description>
    <InputEncoding>UTF-8</InputEncoding>
    <Image width="16" height="16" type="image/x-icon">https://search.marginalia.nu/favicon.ico</Image>
    <Url type="text/html" template="https://search.marginalia.nu/search?query={searchTerms}" />
    <moz:SearchForm>https://search.marginalia.nu/</moz:SearchForm>
  </OpenSearchDescription>
and then add:

  <link rel="search"
      type="application/opensearchdescription+xml"
      title="Marginalia"
      href="/opensearch.xml" />
to your page's <head>.


It's been added a few moments ago :-)

Thanks everyone who suggested this.


This is a search engine indexing the internet on a mariadb database hosted on consumer hardware maintained by a single person as a hobby and it does not suffer from HN hug of death


Huh.

I've gone through the logs, and at most I got about 2 search queries per second as a sustained load over several hours. It's calmed down to about 0.5-1 QPS now.

This is ignoring other page loads. Not bad, if I do say so myself.


It's a statement to how much other websites are bad


Do I wonder how much cookies play into it.

Sure I have small page loads and reasonably optimized code, but I also don't use any cookies. I'd imagine having to keep track of upwards of a million session objects would be a struggle for almost any web server.


how on earth do you index so much on consumer hardware? my frontend developer mind is blown.


Wait till you learn that modern CPUs run billions of cycles per second. With multiple cores in parallel! And they can reach transfer rates of tens of gigabytes per second to RAM, or around a terabyte per second into L3.


And then you add a single HTTP request and everything tones down to the speed of the web. Or I/O. Or DB call.


More like: you add some javascript library and now the browser needs 5 seconds to run 10 MB of javascript.


You can still support millions of these requests per second if you just bake all of the dependencies directly in a small OS running on your fastest raspberry pi.


Consumer hardware today is simply what was cutting-edge and crazy expensive 5 years ago.


Wow this is immediately useful

If you figure out some sort of funding model (maybe even just Patreon) I could totally see this as a viable side project

Already discovered this recipe site: https://based.cooking/

I love how adding recipes is through pull requests: https://github.com/LukeSmithxyz/based.cooking/pulls


Thank you for this, it really makes me love the web and the people making things like this, Forked!


I've added a patreon, and a few people have hopped onboard. I'm honestly a bit stunned at the reception this has gotten.


As everything in life flows in cycle, I predict the search engine that will de-throne Google will be like Google when it started - a simple variation of page rank.

No smarts, no bubble, no signals decided by over fitting to a biased engineer preference.


I wouldn't say the existence of this page proves your prediction right (as it's not dethroning Google anytime soon).

It's easy to forget that the goal of Google isn't to provide a useful search engine (at least not anymore), but the search engine is a by product of them wanting to show ads.


If Google isn't useful then nobody will use it.

The search engine and the ads are tightly coupled. A better search engine means it can predict with more accuracy what you are looking for and can serve you an even more targeted ad that increases the chance you'll click.


> If Google isn't useful then nobody will use it.

Or they'll continue to use it out of sheer inertia. Google is paying Apple $15 billion to keep its place as iOS default search engine.

IE6 didn't die overnight when Firefox arrived.


now Google try immitate a document system on your computer, usually I rely on Google know what I need:(


Wouldn't the dethroner of Google be some new technology which is not a search engine like Google but better at solving the original task of finding information on how to solve problems?

Just like how iPad dethroned Windows PCs for average home user but not Mac because Windows had the monopoly and then an innovation destroyed MS in this space and not a competitor.

I don't think Google dethrones Yahoo and AltaVista scenario will occur again.


> iPad dethroned Windows PCs for average home user

is this true? in the US, perhaps? because in south america it couldn't me more far away from truth - didn't happen at all


As dev I would love search engine which would only do search to stackoverflow github issues, documentation etc.


You can limit the search query per website in DDG (and probably in others)

Example: `rust slow compilation site:stackoverflow.com`


Yeah but usually I want some set of sites not just stackoverflow.


...especially if you could group online resources by category (e.g. software eng, cooking, ...)


I agree, except it'll optionally accept the ID of your node in a web of trust, and it'll use a page rank customized for you.

Or you can put in two ID's and have it find sources that both parties trust.


I liked this one... I searched for 'George Harrison' and among the first results there was a page with interesting comments about Harrison's solo career; someone reminiscing about the time they got to talk to him about guitars for half an hour at a bar at the airport; a transcript for an interview he gave on TV... Whereas on GOOGLE: an instrusive 'People also ask' which I was not interested; thumbnails for videos on youtube that I was not looking for; previews to garbage clickbaity news articles; and then finally for the search items: a bunch of websites for lyrics; his Instagram (!) and fb pages; his imdb page; some more news articles I was not looking for...

Granted, google's web results above are perhaps what people are looking for 75% of the time, but how limiting and boring.

I'm also a sucker for the simplistic text-centric, information-laden pages from the pre-facebook era.

For 'global warming', however - since Marginalia excludes modern web-design pages - the results are of dubious relevance and interest, since they are, well, 'old'.

I see myself using this engine a lot.


This has been needed in my life for a while. I am growing really apathetic about the internet lately, but I realize that is because my entry point is always a google search.

I miss finding blog posts and scholarly articles in long form. I hate the SEO sites with unreadable UI because the information in them is often a lot lower quality as well.


I think the small web and gemini protocol might be up your alley too, if you feel this way about the Internet.


Oh, I dream of a day where there are multiple useful search engines, specialized for different purposes.

You're doing God's work here. Thanks and good luck.


I wonder if there's any mileage in an extension of something like uBlock Origin's lists of ad networks to block but instead it's a list of known content mills and SEO spam factories to remove from search results?


Kind of reminds me of the past like alta vista and dogpile.


Are there any technical infos about the search engine? Found some information here https://memex.marginalia.nu/projects/edge/about.gmi

Author said they threw it together on consumer hardware. How big is the index? (TB used or entries) how is it realised?

I'm pretty much interested in this since I myself am crawling some pages for my own "search index".

Oh and thx for making and posting. Added it as a keyword to firefox

Edit: Just realized that my question is a bit shallow. What I'm particular interested in is the storage before the indexing. I'm trying to store the raw html so that I can reindex everything with better algorithms, but I'm hitting many limits. It takes a few minutes getting the size of a site-directory (every site has it's own dir) and I'm at a point where I can't reasonably manage the scrape-versioning over git and I cycled through a few filesystems only to find that the metadata management kind of sucks for most of them. It's rather interesting how we store such files and I'm thinking about storing a few sites in a simple sqlite format for easy access and search. I'm thinking about a a few low overhead solutions like facebooks project haystack (implemented open source in seaweedfs) or something similar... Hopefully this gives some context to the question of storage and sites that are indexed


The index is tiny, not even a terabyte. Right now it's a few hundred gigabytes for ~20 million URLs. But it's stored in an extremely dense binary format.

Honestly you may just want to roll your own solution for storing a ton of files. If you don't need a general-purpose filesystem, but an append-only archive with extra metadata, then you can cut a lot of corners. Like if you have a file system that is fixed-size and append-only, you can build it in a way no off-the-shelf stuff can.

This line of thinking is a large part of why my index is so small and fast. I have a lot of special built data-structures that are built for their exact use case. Like a fixed size append-only hash map that uses mapped memory and can in theory be larger than the system memory. Very good for a search engine, absolutely useless almost everywhere else.


This is really cool, it filters out all fluff.

It's not always taking me to totally relevant sites but the results contain my favourite type of content.

Full of writing and pure html - usually the hallmark of someone who knows what they are doing, wants to communicate but doesn't want to waste their time.


This is really refreshing work, and we can all benefit from other search engines focused on improving the field. I tried a bunch of searches and some of them were quite wonderful, others were a little dry on results. But overall I enjoyed going through it. Here is some critiques if you don't mind:

I did search for "Daria Bilodid" and the results were a bit troublesome. First the Wikipedia result did not work: https://en.wikipedia.org/wiki/Daria_Bilodid vs https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid

Secondly the results matched a few judoinside.com results which is ok, including sites to her competitors, but seemed to miss the judoinside website for her: https://www.judoinside.com/judoka/92660/Daria_Bilodid.

The design is hard on my eyes, I have a average size screen and its using less than half of the width. The line-height is enormous and seems to breakup flow making it uncomfortable for me to read. The spacing around each result is the same as between titles and paragraph items, which again was unpleasant to read.


> Secondly the results matched a few judoinside.com results which is ok, including sites to her competitors, but seemed to miss the judoinside website for her: https://www.judoinside.com/judoka/92660/Daria_Bilodid.

The title says this search engine punishes modern websites (images, videos, MB of JS, I suppose), and this site looks scarce on text and heavy on images, maybe that's confusing the ranking.

I certainly find the results very refreshing, but you'll have to complement with other search engines if they're not enough. In fact, I think the days when we could use a single search engine have already passed.


I love this! I've been searching random words with no aim in particular and keep finding lots of interesting tiny personal webpages. It feels like the old web


Gotta say, sometimes the results really are nice. I searched for "Land Cruiser 70." The first result is a simple, short blog post about a couple who traveled across Europe and Asia in their Troop Carrier (http://www.destoop.com/trip/1%20PREPARATION/2%20Vehicle%20sp...).

The first results on Google are Australian site for buying a LC70 (news-flash, I can't buy one in the USA). There is also a MotorTrend article about the LC70...also irrelevant since it's only sold in Australia.


What we need now is a search engine that weeds out sites that have been SEO optimized for keyword density.

I’m tired of searching for “generic keyword” and getting a page with an extremely low signal to noise ratio written like this:

“Many people search for generic keyword. That is why you can find all about generic keyword here. In fact we specialize in generic keyword and slight alterations of generic keyword.”

It’s like Google stopped caring that people were gaming it.


After a few experiments, I'd compare this to panning for gold in a river others ignore because the yield isn't good. You will find nuggets you would have otherwise missed, but you'll work for it. This engine is likely best used to supplement other options.

What set aside Google from its competitors was its use of eigenvalue weights. I don't sense a robust weighting system in use here.


Too bad the search index is currently restricted to ASCII-only (or at least Cyrillic and Latin-2 characters were rejected as "contains characters that are not currently supported").

I love the idea definitely, and I've long toyed around with building a similar thing that starts crawling off my own bookmarks (a personal small-deep-web if you wish).

I also love the "Small Web" name: this is the first I hear of it, and it's what I've long complained about — the web today hides all of the cool gems search engines of old would have given you!

I am also a bit split on the "www" prefix restriction (iiuc, domains which do not have "www" subdomain too are dropped from the index because many of them are spammy): it might for sure be a useful heuristic, but I've advocated for dropping "www" back in late 90s and early 2000s already (one reason being that for eg. Serbian, "w" is not in the alphabet, so you can't reasonably quote it as Serbian is otherwise a phonetic-language).


Fast and doesn't crash when on the front page of Hacker News!


Well,... yet. Load average is at 1.2, not that bad. But the services are getting a solid workout.


The real test is in now, index server is reconstructing its index. It does this every 6 hours if there is new pages. Takes half an hour or so usually.

It's supposed to be able to handle searches at the same time, but jeepers, it's gonna have to chew through nearly 400 Gb of data while dealing with over 1 request per second.


Is your site/code on GitHub? I would be happy to give performance tips/tweaks. Also Fyi, https://marginalia.nu/ gives a certificate error (I know that's not the search site)


That crossed my mind too, considering vanilla webpages sometimes struggle with a top page HN thread, never mind a search engine backend.


Fantastic idea and it works quite well for short phrases that I tried.

As expected I am getting a lot of early 2000s sites which is something that I miss on regular Google.

Hilariously searching for "array data structure" got me one of the top results this little tiny page: http://infolab.stanford.edu/~backrub/google.html


> We have designed Google to be scalable in the near term to a goal of 100 million web pages

Funny, that's about where I see my search engine capping out as well.


See also: https://wiby.me/


The "surprise me..." button is adequately labelled.


Great link to drag to the bookmark bar.


Shades of stumbleupon.


I've got some results where the same site is has than 70% of the links. It was a very on topic and high quality site, but still, all the results shouldn't point to the same place.

I think some grouping by site (and capping to only the few most relevant links there) would improve the engine.


It shouldn't give you more than three per domain on paper, but I've got issues with deduplication so sometimes you get six :-(

Still a lot of work needs to be done.


This is a really cool idea. I tried a few technical queries I did on DDG today and didn't get amazing results - hence the warning in the About page about this engine giving you things you didn't know you were looking for, rather than specific facts. But the examples others have posted sound promising and refreshing. I would love to read about the algorithms behind this and how modern web design gets detected in order to punish it...


Yeah I think you typically have the most interesting results searching for older topics.

I've posted a few comments outlining the tech, check my profile. It's also outlined some details in my blog if you dig around a bit, there's no real comprehensive overview, just bits and pieces though.

https://memex.marginalia.nu/topic/astrolabe.gmi


I tested this with "Caribbean Vacation" and wow what a difference. Everything on Google is "TOP X LIST" and "BEST XYZ" which are just the worst when trying to find real interesting information about experiences you can have on vacation somewhere. I had used those as starting points then searched for long-form blogs of real experiences people have had. This surfaced those kinds of things immediately. I love it.


Kudos for taking on this project, and I like the idea! I think it'll be a big project to take it to the next level, but would love to have a search engine that's more useful.

Some reactions:

- The font is really big and the columns really narrow, so I get 3 - 4 entries per page, something like 8 words per line, and huge spacings between lines, which makes it a frustrating experience. I've been using the recommendations in https://practicaltypography.com/, which recommends 60 - 90 characters in a line I think, and line spacing of 120% - 140% (I like 125%). The line lengths here might technically fall within the lower bound, but it's really short, and for search results I'm going to try scanning the text to see if there's something relevant, so I think going on the long side is better here. At least make the width somewhat variable so that I can shrink the rather large font and fit more on the line.

- The results are eclectic, but I'm not sure it's usable at the moment. "scala append list" did not get me much that's helpful, while Google will usually at least put up some click-farming tutorial that although minimal effort does tend to answer the question. Both "mapo doufu recipe" and "ma po do fu recipe" had very few recipes, although the latter did have one. Unfortunately, recipe websites are some of the worst, with about 10 pages of description, ads, pictures, what-have-you until the recipe at the very bottom. "collection unmitigated pedantry" did return the acoup.blog entry at the top, though.

Good luck on the project!

-


My pet peeve with search results is simply that there are ancient technical results that in many cases are irrelevant. If I am searching for a Window error message, I don't want some old forum post from 2001, especially if it didn't have any answers!

What would be cool would be for people who host old stuff to "archive" it at some point so it doesn't appear in normal results, only if you tick "include archives".


As much as the release names for macOS over the years were marketing gimmicks, it does make it a lot easier to zero in on the correct version when doing these types of searches.


Where does the data come from? Do you index the whole web yourself? I see it totally impossible for a personal project. I'm very curious about that.


I do indeed index the web myself. Not the entire web, just a subset of it. The crawler quickly loses interest in javascript:y websites and only indexes at depth those websites that are simple. It also focuses on websites in English, Swedish and Latin and tries to identify and ignore the rest (best-effort).

You'd be surprised how much you can do with modern hardware if you are scrappy. The current index is about 17.7 million URLs. I've gone as far as 50 million and could probably double that if I really wanted to. The difficulty isn't having a small enough index, but rather having a relevant enough index, weeding out the link farms and stuff that just take space.

I only index N-grams of up to 4 words, carefully chosen to be useful. The search engine, right now, is backed by a 317 Gb reverse index and a 5.2 Gb dictionary.


Amazing!

I have only one recommendation that might make the search a bit more relevant, e.g when searching for 'linux locking' or 'kernel locking' kind of things.

Try to upsort things that match near the top of the content, like the top of the man page vs middle vs bottom.

One easy way to do it without having to store the positions, is to index the ngrams with max(sqrt,8) of their line number, this will cover first 64 lines, you can also use log() or just decide ad hock, top, middle, bottom of the document, so you can use only 3 values.

e.g. https://www.kernel.org/doc/html/v5.0/kernel-hacking/locking.... would do unreliable_1 guide_1 locking_1 ... then at line 4 kernel_2 locking_2 ... after line 50 ... then_7 ... and after that everything will be _8.

then just make the query "kernel locking" to "dismax(kernel_1 OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2 ...) with some tiebreaker of 0.1 or so, you can also say "i want to upsort things on the same line, or few lines apart" by modifying the query a bit.

It works really well and costs very little in terms of space, i tried it at https://github.com/jackdoe/zr while searching all of stackoverfow/man pages and etc and was pretty surprised by the result.

This approach is a bit cheaper than storing the positions because positions are (lets say) 4 bytes per term per doc, while this approach has fixed uppre bound cost of 8*4 per document (assuming 4 byte document ids) plus some amortized cost for the terms


This is unbelievably impressive on a technical and ambition level for a solo, self-hosted hardware project. Kudos.


Do you know what proportion of the texty web instructs unknown crawlers to go away (or blocks them)?


It's hard to give numbers, it doesn't seem to be very many, but losing out on a few key sites does make a pretty big impact.

You see stuff like this sometimes, makes me a bit sad.

https://linux.die.net/robots.txt


Cool, I've been thinking on this topic a bit lately. Crawling is indeed not that hard of a problem. Google could do it 23 years ago. The web is a bit bigger now of course but it's not that bad. Those numbers are well within the range of a very modest search cluster (pick your favorite technology; it shouldn't be challenging for any of them). 10x or 1000x would not matter a lot for this. Although it would raise your cost a little.

The hard problem is indeed separating the good stuff from the bad stuff; or rather labeling the stuff such that you can tell the difference at query time. Page rank was nice back in the day; until people figured out how to game things. And now we have bot farms filling the web with nonsense to drive political agendas, create memes, or to drown out criticism. Page rank is still a useful ranking signal; just not by it self.

The one thing no search engine has yet figured out is reputability of sources. Content isn't anonymous mostly. It's produced and consumed by people. And those people have reputations. Bot content is bad because it comes from sources without a credible reputation. Reputations are built over time and people value having them. What if we could value people's appreciation relative to their reputability? That could filter out a lot of nonsense. A simple like button + a flag button combined with verified domain ownership (ssl certificates) could do the trick. You like a lot of content that other people disliked, your reputation goes down the drain. If you produce a lot of content that people like, your reputation goes up. If a lot of reputable people flag your content, your reputation tanks.

The hard part is keeping the system fair and balanced. And reputability is of course a subjective notion and there is a danger of creating recommendation bubbles, politicizing certain topics, or even creating alternative reality type bubbles. It's basically what's happening. But it's mostly powered by search engines and social media that actually completely ignore reputability.


> The hard part is keeping the system fair and balanced.

It is, which is why I think the author should stay away from anything requiring users to vote on things.

The problem with deriving reputability from votes over time is in distinguishing legitimate votes from malicious votes. Voting is something that doesn't just get gamed, it gets gamed as a service. You'll have companies selling votes, and handling all the busywork necessary to game the bad vote detector.

Search engines and social media companies don't ignore this topic - on the contrary, they live by it. The problem of reputation vote quality is isomorphic to the problem of ad click quality. The "vote" is a click event on an ad, and the profitability for both the advertiser and the ad network depend on being able to tell legitimate clicks and fake clicks apart. Ludicrous amounts of money went into solving this problem, and the end result is... surveillance state. All this deep tracking on the web, it doesn't exist just - or even primarily - to target ads. It exists to determine whether a real would-be customer is looking at an ad, or if it's a bot farm (including protein bot farm, aka. people employed to click on ads en masse).

We need something better. Something that isn't as easy to game, and where mitigations don't come with such a high price for the society.


> It also focuses on websites in English, Swedish and Latin and tries to identify and ignore the rest

When I search for Japanese terms, it "says <query> needs to be a word", which wasn't the best error message. Maybe the error message should say something like "sorry, your language isn't support yet"?


I've rephrased the wording for that one a bit.


How did you go about seeding your web crawler with URLs to crawl?


I just started with my website and did a crawl. Subsequently I've been seeding it with the best results form my previous crawls.

It's a directed search so it doesn't seem to need a particularly solid seed to get decent results.


So how long did it take to get to 17 million URLs?


Not OP, but if I was to do this, I'd start by downloading Wikipedia and all its external links and references, and crawling from there. You should eventually reach most of the publicly visible internet.


I feel a little embarrassed that I didn't think of something like that.

When I did some crawler experimenting in my younger years, I thought I was pretty clever using sites that would let you perform a random Google searches. I would just crawl all the pages from the results returned.

Your method would undoubtedly be more interesting I think. It would certainly lead to interesting performance problems quicker, I bet.


Congratulations. Truly impressive search results. I tried two, one word searches. The results were interesting, useful, and would have been impossible (well, really really hard) to find, on standard search engines. Plus, no garbage, ads, recommendations, etc etc. As another commenter suggested, it is what World Wide Web searches results were like, twenty years ago!


PS. I added Marginalia as a search option (even the default for now) in Firefox Nightly (on Android). In case others want to, under settings for search, you can add other, then name, and then: https://search.marginalia.nu/search?query=%s


After a good amount of searching it doesn't seem possible to add Marginalia as default search in firefox (84.0b8) on Debian.

I did not expect this to not be available.


I am running Linux Mint, and I did not expect adding a custom search engine to be missing from Firefox. There are plug-ins, but I don't like adding plugins.

https://mmphosis.netlify.app/search.marginalia.nu/


Open a search, right-click on the address bar. The author added the magic metadata needed, so Firefox should be able to pick it up and offer "add search engine" option in the context menu.

I know. It's an idiotic UX regression over being able to configure search engines yourself in the Settings screen. Takes control away from the user.


You should monetise this with amazon affiliate links that are relevant to each search. And then use that money to keep this project going. Google is fantastic, but it has become something different from what it was, the company and the product. It is so refreshing to see a modern tool that encourages exploration of the actual world wide web.


I might add a donate button or something if people want to help support the project, hardware isn't cheap and all. But I have a job and decent income. I think if this search engine became the way I earned money, it would influence the project in a bad way, and corrupt its purpose, which is to help people explore the less-commercial internet.


Appreciated. The more things fill up with monetizing shit, the more I stay away. There's something beautiful in having higher purposes than grubbing for cash.


+1 for a donate button; much preferred over affiliate links or ads of any kind. Thank you for making this beautiful little(?) product!


I'd donate to continued expansion/development of something like this. Where is somewhere good to follow you for any thoughts/updates?


I have something of a blog here, with an Atom feed.

https://memex.marginalia.nu/log/

It's not very well optimized for mobile, really it's more of a bridge for my geminispace content.


I added a patron, and huh, a few people are actually donating. I don't really know what to say. Thanks!


That would be an absolutely awful decision.


I searched for "Starlink satellites" and found this Y2K-style Canadian UFO blog [1] explaining it isn't aliens. I might just waste my weekend with this search engine.

[1] https://www.ufobc.ca/Reports/stringoflights.html


As a quick test, I searched for the name of one of my favorite game series: "Baldur's Gate" (on its own, no qualifiers, properly spelled - I would usually spell it "baldurs gate" on Google, but I decided to give this one the best chance). I search for info around video games a lot, so that's quite representative of a good chunk of my web searches, and I pretty much know the top sites Google would give me for that query (on its own, without any further qualifiers).

The results were all either barely relevant, outdated (sites that covered the game back in the 90s/2000s before it was re-released), at best tangentially relevant or complete garbage noise. Some of the most highly relevant pages (such as the Steam store listing, the fandom wiki, the publisher/developer's forums for the re-releases, the Baldur's Gate 3 website and the subreddit) were not included at all. Those are all fairly text heavy by any reasonable standard, so I assume they were "punished" because they use JS? Would make sense that nearly all of them are way out of date.

Then I searched specifically for "Baldur's Gate Wiki" but still out of luck - some results, but nothing vaguely Wiki-like.

Finally I searched for "Baldur's Gate Fandom Wiki". This is basically "search engine easy mode", by giving essentially the name of of the site I am looking for. I got ZERO results. At this point I gave up and decided that this thing is useless.

Look, I'm all for unearthing good long-form content (in fact I would say that much of the content around this specific game would qualify), and I do get as annoyed at modern SPAs as the next grumpy neckbeard.

I think considering both of those in a search engine is not a bad idea in and of itself. But I have to wonder what's the point of a search engine that weights some arbitrary aspect of web design higher than the relevancy of the subject matter (to the point of not returning any results at all)? In fact, considering that generally speaking more recent websites tend to include more scripting, you are intentionally skewing the results towards (very) old content, which is probably doing the user a disservice.


This just isn't the place to go for promotional materials about upcoming video games. It's a niche search engine for discovering stuff off the beaten path, the stuff you can't find on mainstream search engines. Some of it is junk, admittedly, and not everyone will see the point, that's fine too.

Despite what some people seem to think, it's never been meant as a google-replacement. I have never claimed otherwise.


Fair enough.

I tried it out under the assumption that it was an attempt at a better mainstream search engine, but I guess I should have paid deeper attention to the name :-)


> But I have to wonder what's the point of a search engine that weights some arbitrary aspect of web design higher than the relevancy of the subject matter (to the point of not returning any results at all)?

Because in some cases it returns arguably better/more to the point results, than other search engines – for example search for “Douglas Engelbart” or “Ted Nelson”. I thought that I’ve searched everything for those two yet marginalia gave results otherwise I would have never seen,.


This is a fantastic search engine. It delivers on its promise of "serendipity". I found pages featuring my name that I'm not sure I've ever seen before, after many years of searching myself to test out search engines.

Perhaps more importantly, it delivers the most correct result when searching for my username: the first result is not any of my social media accounts, or even my own blog, but the text of the obscure science fiction story that I took my username from! Well done.

I've immediately added this as a search keyword in Firefox, and I'll be using it more in the future.

Could meta search engines like DuckDuckGo include this as a source? Should they?


I did a quick check using the name of the Scottish village I am originally from (or as I should say "far am fae") and this produced a much more interesting set of links for me than that produced by Google


Great results for "sauna". Lots of Web 1.0 pages discussing building plans and displaying pictures of individually built, traditional, unique, old saunas on some property.

The Google result are all blogspam or sales pages for cheap shipped saunas. Lots of "IR" results. Phony health benefit pages. Stock photos solely of beautiful new hotel gyms.

I've noticed this problem with Google results for quite some time. Sadly, the new content being created of the top variety is mostly being done within private Facebook groups that can't be easily searched, linked, or archived.


- Semantic HTML; not everything is a div; correct use of markup.

- Search results are not overrun with commercial, SEO stuffing, "content" farms.

I don't know what to say. This is such a refreshing sight. Well done.


Absolute textgasm.

Wonder how 'text-only first' prioritization is being implemented, algorithmically speaking?


This post - https://news.ycombinator.com/item?id=28551183 - suggests it's a simple set of hueristics, looking for things like javascript, link/SEO spam, language, amount of text content, etc, filtering out unwanted results and only indexing wanted ones.


Thank you!


Wikpedia links point to <https://encyclopedia.marginalia.nu/> instead, which to my eyes is less readable. The justified text, done with CSS, instead of the LaTeX algorithm, looks wild. The font used for quotations is even worse (very thin).

Wikipedia is perfectly usable without JavaScript and it's one of the nicest sites out there typography-wise, so I'd reconsider this redirection.


I guess it's a matter of taste. I can barely read anything on regular wikipedia because the inline links disrupt my flow.


I have an interest in logic and cs curriculum and i like Geneses in general(last days i've read intro in math phylosophy from Russell and some acm report of cs curriculum. I search for cs curriculum and this is the first link https://www.cs.rice.edu/~vardi/sigcse/ Feels so good to recive good answers so easy. Thanks.


A common use case, how to do random thing in programming:

I searched python make a bar chart and it returned a live coding video with an AI generated text transcript and two articles which mentioned a different kind of bar.

I then narrowed it down to just python bar chart, and got a blog post about scripting with a bar chart in it, this http://www.nitcentral.com/voyager4/hellyear.htm with monty python, bars, and charts from 1996 and among some other things I found this https://python-course.eu/naive_bayes_classifier_introduction..., which had an example of a python bar chart even though the title of the page made me think it wasn't what I wanted.

So for what I imagine to be a difficult search because of all the different meanings of the words, I found my result on the second query pretty quickly, and found some cool unrelated stuff too.

I like mostly that I get what I type in, and not exactly what I want, but what I want is there too.


I would probably use this if I wanted to find interesting blog posts/websites about a topic I want to learn more about in general. It seems less useful for returning exact answers to specific questions.


I use webcrawler.com, and IMO it's better than any other search engine for finding exactly what I'm looking for. Not what's "trending", or "popular", or what the sheeple are searching for. It finds the exact matching keywords that I'm looking for. No inference or other bullshit -- just the matches.

Such a relief to not wade through oceans of worthless crap any more.


Pretty cool. I am not sure yet how useful, but cool it is.

However, it seems that it currently does not support non-Latin alphabets. Which I understand in an early version. Still, it's handling of such "exception cases" could be improved:

when I search for a Russian word, say "Аквариум", I get <<Search "Аквариум" needs to be a word>>, which is rather rude...


"It also focuses on websites in English, Swedish and Latin and tries to identify and ignore the rest (best-effort)."

https://news.ycombinator.com/item?id=28551183


Fine; still "Not a supported language" would be much better response than "not a word".


I've also found that brave search gets much better results than google for some programming related topic, simply by not being targeted by blogspam SEO as much. It's refreshing to not have to click through 3 auto generated "articles" but to either a) get the documentation straight away or b) find actually human written blog entries.


I wish niche search engines has an option to group results by domain names. There are a few major sites that dominate Google search results with low effort content. As long as Google stands as the largest search engine, it's unlikely that these major sites will want to rearchitect itself into different domain names.


This is incredible. I just got goosebumps as I stumbled upon https://solitaryroad.com after searching for "linear algebra homomorphism". It reminds me of the magical feelings of the early Internet. Keep up the great work!


You can add this to Firefox as search engine option by right clicking on the URL and selecting "Add Marginalia". From there, setting it as your default search engine is done from the "Settings" panel as with other predefined search engines.

I'm experimenting with using it as my default ...


The results are fantastic, but I can't see how the excerpts relate to the search term.

For example, a search term for 'Scotichronicon' returns some fascinating results, but the search term itself doesn't appear in the title or excerpts of most of the results.

This makes it harder to judge how relevant they are.


The excerpts are static and very best effort. You just have to visit the website and find out I'm afraid.

I can do a lot with what I have, but I can't do full text search on millions of documents with dynamic excerpts off a single computer in my living room.


This is wonderful and stupendous.

I’ve often thought that Google could be turned back into a good search engine by simply eliminating the crap and letting the useful sites float to the top of the results.

marginalia.nu seems to like my sites, so it must be good!

Some results are prefixed with ! or an arrow dingbat. What does that mean?


Nice! Typing my name in, gets my own site back as 3 of the top 5 results. I suddenly feel important ;)


Hi, It'd be nice if you could add a OpenSearch description document for your site.

https://developer.mozilla.org/en-US/docs/Web/OpenSearch


until then i'll keep the site bookmarked. :-)


I searched for "giraffe evolution" (without quotes) and received the following links on the first page:

- Evolutionist scientists say the theory is unscientific and worthless

- Seven Mysteries of Evolution

- OTHER EVIDENCE AGAINST EVOLUTION

- Evolution Falsified

Not a single result about the evolution of giraffes...


Unfortunately, as you've discovered, giraffes are often used by crackpots to try to disprove evolution. Google seems to get around this by heavily boosting known authoritative sources like National Geographic and NIH. But, sadly, those are JS/image heavy sites.


This came back in a result set and I am thoroughly pleased.

http://www.aliens-everything-you-want-to-know.com/

This is the Information Superhighway in all her glory.


This is a fascinating tool, I estimated that the corpus of the factual web was between 1 and 10 TB when I last played around with BigQuery using domain names which had low amounts of click bait. Seeing these search results I suspect my estimate was off by a couple orders of magnitude.

Although a search for "Fractional Reserve Banking" shows that some further ranking improvements can be made to exclude unrelated results, and potentially penalize old conspiracy sites.

https://search.marginalia.nu/search?query=fractional+reserve...


I think I want a BBS. Text mode, fixed width font, keyboard-driven menus, no (or very little) bitmapped graphics. I've been thinking about the UIs for a lot of sites that I use to "do things" on the web. E.g. search for flights. Do I need any of that "beautiful" web design with pretty forms and fonts, bevelled edges, drop shadows, drop-down menus, hovers? Hell, do I even need a map? Heck no, I need three text entry fields and output a bulleted list, maybe table of results. Just give me the raw data and do as little presentation as possible, thanks.

I really think I want an internet console, not an animated magazine.


I wish we could configure Google's algorithm to our needs, and blacklist websites.


It could get tedious depending on how many sites you want to block, but you can add "-site:google.com" to exclude google.com, for instance.


I mean a blacklist system like Twitter's, where you block a website forever. Pinterest would be the first to go.


Ivermectin (marginalia): https://search.marginalia.nu/search?query=ivermectin+

Ivermectin (Google): https://www.google.com/search?q=ivermectin

The difference in the overall _thrust_ of the results is remarkable.

Very interesting! Thanks for building it.


Though before basing life-and-death decisions on this, consider reading the "about" page first: https://memex.marginalia.nu/projects/edge/about.gmi

> The purpose of the tool is primarily to help you find and navigate the strange parts of the internet. Where, for sure, you'll find crack-pots, communists, libertarians, anarchists, strange religious cults, snake oil peddlers, really strong opinions.

and

> If you are looking for fact, this is almost certainly the wrong tool.


Second result: "Ivermectin, a miracle drug against Covid Ivermectin, a miracle drug against Covid. 100% effective as preventative and for early stage Covid. Over 90% cut in fatality rate for late-stage cases.

https://truthsummit.info/blog/ivermectin-against-covid.html "

Eh I think I'll take the google search results on this one.


Yes, google returns results from the FDA, American Journal of Therapeutics pro Ivermectin study, WebMD, the CDC, the NIH, Wikipedia, the WHO, and New York Times.

Marginalia returns results from Wikipedia, a faculty member’s university blog regarding river blindness, a website called truthsummit promoting it as a miracle cure, a website called vaxxchoice promoting it as a cure, vitamindsstopcovid, etc.

I’d say the quality of the results are quite different.


The Google results tell you why Ivermectin is not a good replacement for vaccination against Covid, the Marginalia results tell you that Ivermectin is a miracle drug for treating Covid 19. Really shows how much technology has the power to change reality in today's world.


Part of what I wanted to show with this project is that there is no such thing as an objective search engine. Even seemingly irrelevant technological decisions drastically impact the narrative.


It's definitely not irrelevant. My first search time covid related because I knew the non-official, random person on the internet blog wouldn't have the money to create flashy sites.


Well the focus on text content isn't the only technical difference here. Google is obviously weighing hundreds of signals in its search results that your engine is not accounting for. These omitted signals are also relevant.


Google is actively fighting the spread of disinformation. You can see this clearly in the forced row of COVID PSA links on the YouTube front page that's been up for the last year regardless of whether you have any history viewing such content. There is manual intervention going on to prevent the garbage their normal algorithms will automatically surface. This is the greatest tragedy of the internet in that it allows people with crazy notions to find each other and build echo chambers with the aid of unbiased ML.


> These omitted signals are also relevant.

Certainly. And sometimes they're relevant in a good way, sometimes in a bad user-hostile way. Every search engine rquires discrimination and intelligent usage by the person doing the search, just in different areas.


Right, but that is still a technical decision on their side. They presumably don't sit down and have a meeting about what world view they should present. Well I hope they don't.


Google links to the FDA, CDC, WHO, NIH, WebMD, drugs.com, and a pro ivermectin journal article from the American Journal of Therapeutics.

The Marginalia results point you mostly to random blogs.


Lol!


Question: how do we benchmark search engines? Are there any groups attempting to provide (open) solutions in this space?

(It seems to me that if you want to build a good search engine, this is the question you need to address first.)


The search term you might be looking for is "information retrieval" there are pretty standard measurements for whether you are getting good results, but they are generally conditioned on stuff like click through rate, comparing to expert ranking and other signals that the user gives you that it was a good or bad return of search results.


The benchmark clearly must depend on the purpose of the search engine, though. If the purpose is to find a thing you know is out there, but haven't been arsed to bookmark, then my search engine is hot garbage.

If the purpose is to provide links you will find interesting when you click on them, my search engine really actually has its moments. I guess it's almost closer to old reddit, before it was overtaken by astroturfing and bots.


Not sure how to contact you in case of possible bugs/problems, just going to drop a comment here. Was trying a bunch of "define:" queries and noticed something small;

This works https://search.marginalia.nu/search?query=define%3Ahallucina... This doesn't, or at least it's empty https://search.marginalia.nu/search?query=define%3Ahallucina...


Yeah it's probably not the most complete dictionary. I'm using Wiktionary-data. There might be the occasional gaps and vandalism.


Searched for “chocolate chip cookie recipe”

First result had a recipe I could see both recipe and directions in a single page, no ads, no scrolling, no fake seo anecdotes about kids and grandmas.

(Pls make the search query box fit small mobile devices)

Great project idea!


I've read most of the comments here and people are evaluating the search results: all good information.

I'm looking at "punishes modern web design"... This thing IS modern web design. I think it's called "marginalia" in reference to the huge margins they chose!

I'm using a browser on a linux desktop and side-by-side, HN's page design is old-fashioned tasteful making pretty good use of space, and maginalia has a font that's more than twice the 2D pointsize and is so spread out with whitespace that the "Tips" on the home page are off the bottom of my window.


how about a search engine that bans all pinterest content.

I hate pinterest with a passion. I may need to get a "his" laptop separate from my wife, since she needs that darn pinterest extension for pinning photos.


Quoted from the linked site:

> Convenience functions have been added, and the search engine can now perform simple calculations and unit conversions. Try 1 pint in cubic centimeters, or 50+sqrt(pi). This functionality is still under development, be patient if it doesn't work.

Why would you make any ever so small effort to implement calculations? I don't get it.

If your search engine enabled me to find more useful search results to my queries than google or yacy or whatever, I wouldn't care one tiny bit about being able to do calculations with it.

Why not focus on the search functionality?


I implemented calculations because easily 80% of my google queries are calculations, unit conversions, etc.

Search functionality is larger priority. Calculations and unit conversions were an afternoon's break from the search functionality :-)


Ok, I guess people use google (or search engines in general) differently ... I rarely if ever use a search enginge to calculate stuff or do unit conversions.

I use google only to search and when ecosia/bing doesn't return anything useful.


How else do you do unit conversions? I use Google because it's far easier than any other software I've tried. Mainly because it's more forgiving of errors. It knows that "34 fset in msters" is 10.3632 meter. This search engine isn't, though, so I wouldn't waste time trying to discover its unit conversion syntax rules.


On macOS for example I'd use Spotlight.


You can also use the Chrome address bar without hitting the enter, just start typing


I love it. Even though it didn't give me the results I was looking for. I searched "new york fishing license", and it didn't give me any links to the actual new york fishing license websites. But it did give me a ton of really cute little websites related to lakes and fishing in New York. This one has amazing information about fishing all over Western New York: http://www.huntfishnyoutdoors.com/fishing.php


Interesting approach.

I always search myself on new search engines to compare the results. Most engines return my personal blog/website, books/stories I've written, news stories, my github projects/contributions, social links, etc.

This search engine surfaces just three obscure IRC logs that contain my nick in join/part messages (nothing said from me!) from 2009. And nothing else.

There's probably some things this approach is really good at but I'm not sure what they'd be for me off hand. Always cool to see new approaches to search, though.


Wow, I'm floored by the quality of the search results.

For example, I'm a huge fan of obscure Brian Eno stories and interviews and articles.

Using the default blend for "Brian Eno", I found http://www.moredarkthanshark.org/eno_interviews.html which is truly a labor of love and the most comprehensive list of articles (with links to them, too) I've ever seen.

Not in a hundred years would I have found this using Google.

Thank you for building this!


I had an idea for an alternative search engine a few years ago thats a bit simpler to implement than this one. First we extract all the external links from Wikipedia dumps. Then we ingest only those sites into the index. The entire database comprises sites that have already been screened by Wikipedians. Gaming Wikipedia is generally more difficult than gaming Google or Bing. In theory, SERPs would be of a less commercial and more substantive nature than those we get from Google and Bing.


There is an open standard way for an engine like this to provide a mechanism for your standards aware browser to add the site as a alternative search with a click.

That way I would not have to remember or bookmark just use my search bar as normal and choose which engine for this query or set it as default.

[]https://developer.mozilla.org/en-US/docs/Web/OpenSearch


This is great, I like the results. Couple of things I noticed:

- Search results often very old, from the early 2000s (I guess because back then more websites were text oriented). Are you taking into account the age of the page when showing results? It would be great to see more up-to-date results at the top

- I noticed a few results which directed me to websites with security risks, Firefox didn't even let me open them. Is it possible to filter these out from the results?


As a sufferer of Tinnitus, and having spent near 100 hours researching it, I found a few sites I had never seen offering great data and tools. Thank you


Honestly it's probably not a great source for medical advice. At least take what you read with a healthy grain of salt.


Based on a few searches it seems to favor sites with very long passages of text. Search for a name and you get pages with massive lists of names. It quite simply isn’t very good at everyday searches. But it does bring up the point, shouldn’t I be able to tell my search engine I want results like this? It should be a feature of google I can turn on and off. It should be one of many ways to impact relevance.


One use case to always test: "online wishlist" or "make a wishlist". If you start seeing tools like https://www.DreamList.com or others, you are on the right path. If you start seeing random web pages linking to individual wish lists, then people are likely not able to find tools on your search engine.


If the website is targeted towards international audience then its nice to have the first page links to content in english. All the four links in the main page https://www.marginalia.nu/ have links to non-english content which is not useful.

Disclaimer: I am not a native english speaker. English is my second language.


Yeah my main site is a bit of a disorderly mess. It started as a Swedish blog, but I've since added a few services aimed at a global audience. Haven't quite figured out how to unify it all just yet.


I tried two searches in Norwegian ("norsk ordbok" [norwegian dictionary] and "stortinget" [the parliament]), and they both returned many extreme or "alternative" websites. It was especially striking that the neo-nazi group Vigrid's website was the top hit for both searches. Maybe these sites just have less modern web design?


Yeah this is actually a bit of a concern of mine.

As very much a friend of Voltaire's, I don't think it's my place to police people's opinions no matter how disagreeable, but I also don't want my search engine to become branded as the search engine of choice for nazis because it's decent at cataloguing extremist sites.


I like the concept, but I did not work on any of the search phrases I entered consisting of the full title of a computer science article or book.

It also does not work for subjects. For example, if you search "discrete math" it links to academic webpages, but most of them do not have any notes posted. It is just a plain text website with the syllabus of a class.


Looking for an arm assembly instruction, instead I get this strange website as the result http://mailstar.net/coronavirus.html

Is that accidental or is this website promoted because it's text heavy and will surface for any search without many results?


Looks like that page just has an absurd amount of keywords. Those sometimes surface when there isn't any good results. Haven't found a foolproof detection method that doesn't unjustly punish innocent pages with large amounts of content.


Searched for my initials - got back a bunch of raw binary results (mp4, pdf, img, txz, etc.) which was disconcerting. Although it did find one reference to actual-me which is better than Google manages on the first 4 pages...

https://imgur.com/a/n2xro2Y


Yeah there was unfortunately a problem with the content-type code recently, it unfortunately categorized some binary data as HTML and tried to process it best-effort. So there's some binary soup in the index.

The bug has since been fixed, but it won't come into effect in a few weeks.


Instead of looking for something specific, I decided to try a category of some sort to see what came up. Thinking about Jeopardy categories, I tried "potent potables" and found a lot of random pages that may or may not have made sense given that category but that I had a lot of fun reading. Definitely a win for me.


The website itself seems generated with some kind of kick-ass generator from template files (.gmi?)

I feel like I'm stuck with Wordpress.com because it brings me some traffic (whereas something hand-rolled on nsfspeech or digital ocean or whatever would literally be off the edge of the web), but the structure of that is so cool.


You can easily do proper SEO with static site generators too. Even more, static sites can be hosted via GitHub or GitLab Pages, Netlify or CloudFlare and in all cases the speed will outperform Wordpress in almost all cases. Also, you have way more control over the output than with Wordpress.


That would be gemini protocol!


This is brilliant - I can actually surf the web for fun again. This engine's actually a nice complement to another mainstream engine, as the regular one is good for searches during the working day, whilst "marginalia" (?) is great for recreational reading, and actual learning.


This is really cool! So retro!

Here is the second result when you search for “cat food”. It takes you to some old dudes entire family tree with full history and biographies… it even uses sub domains and everything! Crazy!

http://www.torrens.org/


> This search engine isn't particularly well equipped to answering queries posed like questions, instead try to imagine some text that might appear in the website you are looking for, and search for that.

Heh, I guess I'm getting old but I remember when this was the only way to search the web


> Don't be afraid to scroll down in the search results

I never knew it was fear that was preventing me from scrolling


Love it! Searched for Shigeru Miyamoto and this was the third result: https://www.glitterberri.com/developer-interviews/miyamoto-h...


Gave it a go with two different queries. The first I chose was “amazon vendor services” didn’t get a single result about the topic.

The second query was a nation+city(in the nation). Got a lot of result that were in no way related to either.

It seems to be biased towards IT topics (based uniquely on the two queries).


One nitpick that kind of bothered me - on a large desktop monitor, the results page was like 70% whitespace margins with the results squished in the middle like a portrait cellphone. Hopefully it's easy to fix, I like to research at home and this website could help a lot!


All of my searches are turning up unrelated results ("college life after the pandemic", "post-pandemic teaching in higher education", "football news NFL" etc.)

NFL one had 'some' decently related results, but the websites were all strangely disreputable.


> the websites were all strangely disreputable

Interesting you'd feel that way when sites without "modern design" are encountered. Is this your own bias perhaps creating a judgment or are they sites that you already know have a bad reputation?


Or perhaps the websites being returned are garbage? I have the same experience trying a few searches and following the top 5 links. Besides wikipedia, I haven't found a single useful website.


This!

Maybe modern or 'non-modern' web design just isn't a great litmus test for quality content? Could just need some work. At any rate I wasn't clicking on the results.


from About page:

> If you search for "Plato", you might for example end up at the Canterbury Tales. Go looking for the Canterbury Tales, and you may stumble upon Neil Gaiman's blog.

I know it is just a suggestion, but had to try searching both, with no luck in getting the expected unexpected.


Yeah I did some work very recently aimed at improving the relevance a bit. It was a bit too random in the state it was before. Now it, perhaps, isn't random enough anymore.


It looks very nice anyway, great job! I did try with other queries and results were in general interesting.


Very cool idea! Room for lots of improvement, keep working on it, I like the direction this is going.


I just got it working reasonably well just this week. I've had it "working" for a few months, but the results were always extremely chaotic, bordering on random.


It's excellent, I looked up some physics topics and got some excellent results - real meaty stuff full of text, eqations and applicable diagrams, etc.

I've not only bookmarked it but also I've an icon linked to it on the taskbar. Will watch its progress with interest.


Congrats for the effort, I really like the idea and it works wonderfully for some searches.

However, I searched "infiniband" and the results are far away from what I would expect or like to see. Most of the results that appear first are completely unrelated to the topic.


Really impressed with the results I’m seeing so far. In all searches I have done so far, the results are truly lightweight, and haven’t had to click through any modals, subscription pop-ups or any other junk thus far! Will be using more in the days to come.


Is there anyway to add this as a favored search engine in the browser?

I currently use google as it's set as the default search when I type in the address bar but would love to switch and move google/ddg to a added character like "<search terms> @g"


All the major browsers support adding custom search engines. You just need to specify the URL template to do the search. The common format is to put "%s" as the search term. You can use it for any site, not just things that are considered search engines.

Firefox is a bit different, since you do it by adding a bookmark, and giving that bookmark a keyword. The other browsers I checked have an option under the search engine settings.

After defining the custom search engine, you just type "<keyword> <search term>" in the URL bar.


Ah I'm on FF so I'll have to do the weird bookmark method. A little annoying since it supports other search engines.


Is there a way to set marginalia as default search in firefox?


I read from other comments that you're writing your crawler bot yourself. Instead of crawling "from scratch", have you considered using an existing DB like Commoncrawl? Or is there something else that you index not present in Commoncrawl?


Cool idea but it needs to be able to handle special characters. Right now searching "Hello World c#" returns no results because the search term can't handle #. I also can't just delete the # because then I would be stuck writing C...


An interesting concept and awesome work!

I searched for high pressure air (HPA) regulator trying to find a description of how one works. I didn't find that, but did find some interesting links on how they're used in scuba, and one guy's homemade gas laser.


It you like wacky search engines, there's also Million Short: https://millionshort.com where you can search and remove the top 100/1K/10k/100K/1M results.


A big fan of your work! Just wanted to let you know what iOS devices provide quotes as “ rather than " - you may need to support the character “ or at least let people know that iOS is not supported etc… right now I get a generic character error.


That's very good to know. Thanks!


Tested with the first person to settle on Island: https://search.marginalia.nu/search?query=Ingolf+Arnarson

and it worked surprisingly well.

Anyone else has good examples?


> New: You can now look up dictionary definitions for words. If you for example don't know what the definition of is is, you can inquire thus: define:is.

Oh man, I love subtle jabs and tongue in cheek writing like this. Very Robin Williams-esque.


I am the first to admit it's a pretty dated reference.


Can we submit text-heavy sites for possible inclusion? Assuming they pass your filters.


Fantastic project! Found very interesting links to a lot of compiler related keywords. A similar service, yet different in their approach to cut through the e-commerce and seo optimized websites I found useful is MillionsShort[0]

millionshort.com


Great for a text-focused site- however, the results are a bit confusing. Would help if there were more details on the criteria used for a site to be included in the index.

Suggestion - Use system fonts (the site downloads almost 300k of fonts)


Not sure what to do with this.

https://search.marginalia.nu/search?query=gan+charger

aside from nexperia none of this looks even remotely relevant.


I'm all for nostalgia but IMO the information should be the most important thing. Presentation is a close second though, and I can kind of get behind this project when I see what the web is like without an ad-blocker.


Cool!

There do seem to be some text encoding issues though. For example: https://search.marginalia.nu/search?query=tim+visee


Yeah I think the charset detection needs work.

It understands the "Content-type: text/html;charset=utf-8" -header, and <meta charset="UTF-8">

but not

<meta http-equiv="content-type" content="text/html; charset=utf-8">

It turns out HTML has a lot of corner cases. I'm constantly marveling at how web browsers hold together as well as they do.


Thanks for your response! Hope you can implement this as well without too much trouble.

I wonder if you could just assume UTF-8 to be the default these days. I imagine that to fix many other cases as well.


Haha! I did actually assume UTF-8 at first, but being a search engine has a lot of older websites, I sadly got a lot of encoding errors doing that, too.


Maybe just like js-heavy sites also punish non-conforming-encoding sites.


Wow. Love this.

Searched for “Ramanujan”, one of my heros.

Found this gem- https://math.ucr.edu/home/baez/ramanujan/

Ramanujan’s “easiest” formula.

Awesome!!


Seems like this is still very very hard! I searched for "hart protocol" hoping to find this: http://www.romilly.co.uk/


I adore this. Unfortunately, searching for my own name - with or without quotes - doesn't actually find my site.

It does find a handful of references to me from over twenty years ago, though, which I thought was fascinating.


My name retrieved the "dead pornstar list". Unexpected.


This is the most amazing thing I have seen on here in at least a year!

It's... no... it can't be... a search engine that finds actual information instead of 5 megabyte blobs of tracking code and SEO crap!


Search for playwright waitForSelector and you land in pretty useless page. I'm all in for text websites, but something like playwright.dev documentation is top notch - fuzzy search being key thing.


Yeah I wasn't really planning for this to blow up like it did today. It's currently sitting at about 35% of the index size I usually aim for, so besides the stuff I can't index because it's behind CDNs, there's a lot of pages it just hasn't gotten to yet. playwright.dev is pretty low on the priority list because it has a metric crap-ton of javascript on its front page. The crawler has visited it, looked at it, and put it very far down the priority queue.


Even though some sites have a metric crap-top of js they sometimes render very minimally for certain screen sizes or mobile devices without any of the js crap. Does your crawler pay attention to any of that?


It doesn't look at what the javascript does, just how much there is.


This would be awesome if the search actually worked. Typed in 'runescape' and expected a few websites left over from the early 2000s. But I got nothing, just a lot of hits to other keywords.


Huh. That's an interesting case you've found.

I think part of the problem is gold sellers were displacing all the good results. I blocked a few of them and got a few more relevant results, but it's not great.

But still, the search engine only finds two great hits. I wonder why not. Maybe there just aren't that many runescape pages around still? Or it may just be that it hasn't found anything better yet. The index is pretty shallow right now, only 20M URLs, I aim for more than double.

I honestly wasn't planning on this blowing up on HN at this stage.


Searching for your own name will turn up some interesting results! I got some early 90s webpages that just contain obituaries or marriage records. I never knew cities maintained these records online!


Is it fair to assume that text-heavy sites that are inactive (but still online) don't have SSL?

If so, would you ever tweak the parameters to surface sites that that aren't served with "HTTPS"?


Love it. You should provide a link to Patreon / whatever so people can support you financially. Hosting is probably not cheap for you. Given the love here on HN I suspect you'd do well.


Hosting is actually surprisingly cheap, but that's because I'm hosting it on consumer hardware in my living room, off my domestic broadband connection.

That's both a blessing and a curse. It works okay as long as I don't touch it, but I can't do maintenance without shutting it down. I can't implement crawler changes without a week of shitty results as it needs to visit half the Internet before it gets decently good. I can only afford a production machine, so all testing that can't be done with unit tests gets done there.

Anyway I added a patreon in case anyone wants to toss a coin.


I don't even want to imagine how google and other search engines crawl websites that make heavy usage of react or other ajax stuff. I don't want to be that guy.

I wonder if some browser engineers are trying to have some ideas on how to find a solution on this. Personally, I would just make a browser that breaks backward compatibility, remove old features, etc. I guess browsers would be much lighter, fast and simple if some hard choices were made.

Mozilla already decided to break some websites with the strict cookie policy. I wish they would do the same for everything else that sucks on the modern web.

I honestly don't think I have much respect for "web developers". In a way I want mobile apps to kill the modern web, just to prove a point.


I just looked up my last name and found a World class heavyweight weightlifter named Josef Grafl born in 1872 who has an awesome portrait of him on Wikipedia. Never before have I read about that man.

I love this.


I don't think this is a good idea: when I'm searching on the web I want to get results with a high relevance to my search query. I don't care a lot about the presentation.


I searched "c strtok" and got one result saying '"strtok" could be spelled "stroke", "stork", "sarto", "strop"'.

Cool concept though!


The spelling suggestions are presented whenever there isn't any results, but sometimes they can be pretty misleading.

What happens is that C, as a word, isn't indexed because it's deemed too short, and the bigram "c strtok" can't be found anywhere.

Try 'strtok' instead.


Does it filter out ad-heavy copy-paste/autogenerated fake sites? Tired of seeing those on the first few pages of Google. Bing gets more and more usable, but far from perfect.


It tries.


Love this! Is there any way someone could help contribute to this?


There's probably a more suitable term than "modern" that we should generally be using, since "modern" consistently has a positive connotation.


Dunno, I prefer to use as neutral or positive terminology even when I talk about things I don't like. I think it very easily comes off as juvenile ranting when you start throwing around terms with strong negative connotations.


How does this have 2.5K upvotes when every single HN related project needs JS and a quad core CPU (for the browser to open a blank page) to view a paragraph of text?


Related question - suppose I want to create a meta search engine for myself, and I want it to be as fast as possible. What are the things I should be optimizing for?


I would like a search that punishes 'modern' SPOs that load 87mb of the author's pet JS projects to display simple text. Basically every modern SPO.


I tried a few queries and got extremely irrelevant results


It really depends on what you search for. A major drawback is that there needs to be text-heavy sites to find, in order for the search engine to find them.

Compare for example the results for "Duke Nukem 3D" with those for "Cyberpunk 2077".


Have you given any thought on what you will do if you get a DMCA take down request or a request from a person asking you to remove them from search results?


This is soooo good. I'm finally finding sites I haven't heard of with good content.

I didn't realize how much I missed this stuff.

The popular web has become so bad nowadays.


How do you submit a site? I searched for "A search engine that favors text-heavy sites and punishes modern web design" but nothing was found.



It is very much a work in progress, still struggling with some areas. I only really got into the territory of "sometimes actually useful" like this weekend. Wasn't planning on blowing up on HN just yet.


Lets get the the internet great place foe knowledge again. I really loved the engine ans tried for different terms and very happy. Goos job


I like the concept of a search engine that does not try to figure out what I should learn based on what I search..I know what I search for


I like the idea but could use some tweaking. I keep getting conservative christian websites for some reason. And foreign language sites


Damn that's is interesting search engine, this is great for search simple terms and find a bunch of blog articles about the term.


I like the idea. However results take too much space vertically it's slow and cumbersome to scan through them.

I think it would benefit from using a responsive layout, allow the text expand to a wide 1000+ px, make the font smaller, so the excerpt can fit one or two lines below the links.

Google has problems but their search results layout is easy to scan.

Otherwise I genuinely wish I would use it, because the Google search's "self referential reality bubble" is really annoying.


So is this a filter on top of Google or is it search from scratch? Would love to understand more of the implementation.


It's built from scratch. I'm doing the crawling and indexing. Look through my comments and you'll find a few outlines of the stack and the index design.

Here are my blog entries relating to this:

https://memex.marginalia.nu/topic/astrolabe.gmi


This isn't meant to be pressurizing or to sound like a demand (if it seems that way), but have you thought about uploading the source code for your search engine?

Something like this has the potential to be used in university courses to teach how to build a search engine and/or teach 'advanced' programming concepts and ideas. It's a real program showing what you need to do to optimize your database and software to work on consumer desktops (even if your specs are higher than what other people would have; 128 GiB for example is quite a lot of RAM for most consumers) and how to handle malicious data that you will come across (for example, link farms).

In addition, I read all the posts on your site that were listed in the page you linked to, and to me those posts would actually seem more useful as an explanation of the code that people can view together side-by-side, rather than as the only way people can know how you implemented your algorithms and search engine. I guess what I'm saying is, having an explanation in words of the algorithms and code along with the actual code can be a very powerful combination for teaching and learning.

Thus, again, would it be alright if you upload a copy of the source code for people (including myself) to look at? I personally don't care about if it's released under an open-source license (or not), or if you just add a zip file on your site vs making a repository on Github, or even if you never update the code you release. I (and most likely others) want to peek at at least one version of what you wrote to see how something like this works under-the-hood, which again, I'm asking if that's alright with you.

Also, I'm not asking you to share the database(s) you have for this, especially since they're giant and would likely take up more traffic downloading from your site than anything the search engine can do.


I'm thinking I may open source some of the components I use, rather than publishing the whole thing, as it's part of a larger monorepo that contains the somewhat integrated code for a large set of services, public and private.

None of the code is particularly fancy, just highly specialized. I did build them myself mostly because I couldn't find anything available that the rather special demands that are put on the application.


That's understandable. I just wasn't sure if anybody mentioned what I said by now, or if you were already thinking about this, so I at least wanted to get the thoughts in your mind before HN closes all comments.

Also, if I'm not being too rude, your posts were a little hard for me to understand, so I thought that releasing some or all of the code would help people (and myself) understand them better, ignoring the teaching stuff that I already talked about. With code, people can debug and change it to show what the program's doing. It'd also at least satisfy some people's curiosity (and probably use up their weekend).

And thanks for replying!


Just searching for 'dogs' gave me more interesting results than I've seen from google in years


"Search results Search "alt.sysadmin.recovery" needs to be a word Those were all the results,"

No comment.


I don't really care about website's design, as long as it gets out of the way of me reading it.


Designed for serendipity indeed. Tried a few searches, results are quite fun, but none of them relevant.


Awesome work! I had similar idea in mind but I'm glad to see someone else was able to pull it off.


curious how do you afford the infrastructure? I found that the hardest part of running a search engine.


I'm self-hosting, and the server is a Ryzen 7 3900x with 128 Gb of non-ECC RAM. It sits in my living room next to a cheap UPS. I did snag one of the last remaining Optane 900Ps off Amazon, and it powers the index and the database--and I really do think this is among the best hardware choices for this use case. But beyond that it's really nothing special, hardware-wise. Like it's less than a month's salary.

It runs Debian, and all the services run bare metal with zero containerization.

Modern consumer hardware can be absurdly powerful if you let it.

Like I have no doubt a thousand engineers could spend a hundred times as much time building a search engine that did pretty much the same thing mine does, it would require a full data center to stay running and be much slower. But that's just a cost of large scale software development I don't have to pay as a solo developer with no deadline, no planning and a shoestring budget.


Wow, if this catches on, my original content will actually matter![1] I've always had a love-hate relationship with modern web design principles because my design choices have all the excitement and polish of what we get on HN.

I'm sure I'm not the only one, either. Content-rich sites need more love.

[1] https://adequate.life


Does this also penalize pages with tons of ads and three paragraphs of text? Or anything from Medium?


Love it! I can punish my employees by setting this as a default search engine on their work laptops.


Beware, I got the impression straight away that some sites were censored from the results for no good reason.

For example, if you search "jehovahs witnesses", all pages from jw.org are missing.

Exactly the same thing happened when I searched "mormons" - the official website is missing and it only brings up sects/hate/conspiracies against mormons.


If you want mormons in positive light, you should search for "latter-day saints", as that's how they typically brand themselves.

jw.org is on the indexing list, just pretty far down, based on the fact that previous times it's been visited it's had ton of javascript.

I don't have an axe to grind with fringe religious movements, I actually love them to bits. Try searching for Nag Hammadi or Hermes Trismegistos.


jw.org appears to be the kind of modern web design this is trying to avoid. I seriously doubt it has anything to do with the cult.


No cyrillic or hiragana suport :-(


Oh, this is brilliant! I think I'll make this my "first stop" search engine.


I predict it will return a disproportionate amount of sites by schizophrenic conspiracists.


"Don't be afraid to scroll down in the search results, unlike in many other search engines, depending on what you are looking for, you may find the best results in the middle of the listing."

This is a very polite way of saying "this engine isn't very good"

Overall impressed with the project but I thought the word play there was funny


I felt I needed to add it to help people taught by other search engines that they only get 1-2 good results, and the rest is useless. The reason I'm providing a hundred results is that there are often a lot of results to choose from. If the point is to find something unexpected, and that indeed is the entire point, then that is the only sane design choice.

Like you search for something on Google and similar, and you know what you are going to find. They are so good at searching the Internet and predicting what you are going to click on that you never see something new.

It's a great feat of engineering, but a huge tragedy, because discovering new things, outside of what you our your demographic has previously demonstrated an interest in, it can be absolutely life changing.


"corporate speak" bs detector and filter on google search engine would be nice.


hmm, I dream of recipes search engine that punishes recipes pages with too much text. lol


Yeah, recipes sites have both too much text and too many pictures.

But they do illustrate what this search engine needs to watch out of. If they rank more text higher and their search site becomes popular, won’t everyone just spam recipe site word salad, maybe even ai generated word salad.

But in the interval, until that day comes, they are going to have a very useful service.


Too bad it rejects non-Latin words, as if the definition of "text" is a sequence of alphabetical letters originated from Latin.

I thought that we've reached the time to embrace all cultures in the world, but this retrogressive engine proves that most modern tech designers are myopic about other civilizations in the globe.


Understand that this is something I built for myself, by myself, so it focuses on languages I understand. It hosted on a single consumer grade computer in my living room. I built it out of pocket and anyone is free to use it. Does this make me a villain?

If I can do this, what's preventing some guy in Japan or India or Peru from doing the same, of course focusing on their languages?


Maybe a better suited choice for errors than an insulting message: when I provided a query in my native language it regurgitated the error "needs to be a word" instead of more acceptable "not a supported language".

When you claim that a word in some other culture is not "a word", just because it's not recognized by your machine, that's demeaning to say the least.


Again it's a one man hobby project, I don't have a team of people to go through every formulation and every error message to ensure nobody can read them in a way that offends them. It's just me, writing code on an unfinished project that HN discovered.

In this case, the code doesn't match the word regexp, like it may be a @TwitterHandle or a "comp.lang.c" with periods in it, or an unsupported Unicode range. It doesn't know why it is not matching, just that it doesn't.


I must congratulate you on this achievement. That's certainly a useful take on search.

Nonetheless, even when coding, one should also consider thoroughly the UX and how it would be addressing the others.

Saying "unsupported word" is much more sympathetic than "needs to be a word" (where you define what a "word" is, and the general user is unaware of such definition).


Fair point, I refined the phrasing a bit.

> The term "𓀀" contains characters that are not currently supported


Stop feeding the trolls, great job on this project and keep it up, hope at least most of HN is more empathetic.


No, it just proves that a one-man hobby project with finite resources found it reasonable to restrict the scope.

Maybe when they find out they're an immortal billionaire they can build all the additional things you've entitled yourself to expect from the freely shared work of others.


It’s one guy. Making a useful tool. It even has an altruistic purpose.

Shame on you for twisting a well intended effort into a negative statement that suits your narrow identity political world view.


Less insulting error messages would be more welcome than casting out others without any consideration.

You may distribute "shame" however you want, but this only helps enforcing the damaging insults and amplifying them.


Let a thousand search engines bloom.

btw, interesting how many http (as opposed to https) sites show up...


Not to overemphasize meta commentary, but damn 3200 points, 650 comments in 2 days -- this is one of the highest rated posts I can remember. Seems HN readers are very interested in alternatives to the current search hegemony and the kind of low-quality junk articles that litter it.


I'm still a bit stunned at the reception this has gotten.

I wasn't even planning to launch this, not like this, there's so much that needs to be fixed to get it working good, so much jank and so many weird limitatons. It's quietly been online for a few months. But it's not really worked all that well except in rare cases. Then I fixed a few issues and implemented some improvements, and it was really just last weekend that I was struck by the sense that it actually was coming together into something really viable, and then... this happened.

I've gotten so many positive comments about this project, a large number of emails and I haven't had the time to get back to half the people who wrote, even people donating money to support it.

I half thought I becoming the next TempleOS-guy, hacking away at some madcap scheme all by myself. I just had no idea this resonated with so many people.

It's incredibly encouraging and motivating. Thanks everyone!


This is great. The results for my search were like a suggested reading list.


Amazing! How do I make this my search engine on browser? Not home page.


A little bit harsh "punishes". It's a cool search engine.


super! How far would you say are you in indexing the blogosphere ? I tried the engine a few times, but I mostly get academic papers and I know most (good) blogs are in fact text-heavy.


It's not indexed particularly deeply. Blogs typically have a decent amount of javascripts.


I have a static website (granted, it's not well linked), with no JS, and it does not seem to be in the index. But I saw in a sibling post that you had index limits, so it makes sense IMO.


Probably just hasn't been discovered then. The Internet is big. I wish more small websites would be better at linking to other small websites :-(


It must be close, I found a page that links to it. Only one or two more other of magnitudes to index it :-)


I'm developing a text-heavy site and philosophically I'm trying to view documents as just that... documents [1].

But I don't get good results for "rug pull".

- 1 https://rugpullindex.com


Yeah it's hosted by cloudflare. I'm currently IP-blocking them, as because they keep prompting my crawler with a captcha, presumably because it's made millions of requests from their CDN.

Some rigmarole getting recognized as a good bot by the CDNs. I've submitted a request fairly recently, but haven't heard back from them yet.

Like I would like to be on good terms with them, and other websites that block small independent crawlers.

I can't blame them though, there's a lot of bad bots out there. But I'm doing my best not be part of the problem.


Aha, I was going to ask how you were coping with CDNs like Cloudflare blocking bots. It's sad we've got to this point where basically only the established search engines are grandfathered in to be able to crawl sites.


I wonder if google had to plead the same way or if already-big players are given a free advantage.


> I'm developing a text-heavy site

I looked at the source for your site's front page. That's not text-heavy; that's markup-heavy. I didn't bother looking at the rest of the pages because it appears to be yet another crypto market site.


I tried “Error 49” as a search phrase.

It’s rudimentary but no IT-related result.


This is awesome! We should definitely move in this direction.


This is really good; I'll actually use it!


Saving this forever. Thank you for making it.


lol this is great, reminds me of the old school search engines we would use in school back in the day before Google haha.


It says it punishes modern web design but it has my most irritating feature of modern web design: a narrow strip of text on an otherwise blank page.


modern design = low information density?


Definitely low signal to noise. Looking at you recipe websites and cooking blogs.