For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.
When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.
But no, these stories all come from cookie-cutter "new media" blog sites, written by an anonymous content writer who's repackaged Wikipedia/Discogs info into Buzzfeed-style copy writing designed to get people to "share to Twitter/FB". No passion, no expertise. Just eyeballs at any cost.
It reminds me of an annoyance I have with the Kindle store. If I wanted to find a book on, let's say, Psychology, there is no option to find all-time respected books of the past centenary. Amazon's algorithms constantly push to recommend the latest hot book of the year. But I don't want that. A year is not enough time to have society determine if the material withstands time. I want something that has stood the test of time and is recommended by reputable institutions.
The problem is that clickbait and low effort articles could be good enough to get the click, but low effort enough to drag society into the gutter. As time passes, the system is gamified more and more where the least effort for the most clicks is optimized.
They do? That would explain a lot - but ironically, I can't find a good source on this. Do you have one at hand?
I think they take different things into account based on the thing being searched.
Engagement, as measured in clicks and time spent on page, plays a big part.
But you're right, to a degree, as frequently updated pages can rank higher in many areas. A newly published page has been recently updated.
A lot depends on the (algorithmically perceived) topic too. Where news is concerned, you're completely right, algos are always going to favor newer content unless your search terms specify otherwise.
PageRank, in it's original form, is long dead. Inbound link related signals are much more complex and contextual now, and other types of signals get more weight.
The only definitive source on this would be the gatekeeper itself. But Google never says anything explicitly, because they don't want people gaming search rankings. Even though it happens anyway.
It's all stamped with Google Ads, of course, and then Google ranks these pages high enough to rake in eyeballs and ad dollars.
Also there's the fact that each year, the average webpage picks up two more video elements / ad players, one or two more ad overlays, a cookie banner, and half a dozen banner/interstitials. It's 3-5% content spread thinly over an ad engine.
The Google web is about squeezing ads down your throat.
The only reason people make content they aren't passionate about is advertising.
Actually that's not always the case. We publish a lot of blog content and it's really hard to publish new content that replaces old articles. We still see articles from 2017 coming up as more popular than newer, better treatments of the same subject. If somebody knows the SEO magic to get around this I'm all ears.
1 Accept quid pro quo to send all queries to Google by default
If what these companies were telling their readers was true, i.e., that advertising is "essential" for the web to survive, then how are the sites returned by this search engine for text-heavy websites (that are not discoverable through Google, the default search engine for Chrome, Firefox, etc.) able to remain online. Advertising is essential for the "tech" company middleman business to survive.
A "Top 10 albums of all time" post is actually better off going through 10 genres of popular music from the past 50 years and picking the top album (plus mentioning some other top albums in the genre) for each one.
That gives the user the overview they're probably looking for, whether those are the top 10 albums of all time or not. It's a case of what the user searched for vs what they actually really want.
The real trick would be some kind of engine that can aim just above where the user's at.
Appropriately enough, I couldn't find a good quote to verify that since Google is only giving me newspapers and magazines talking about Sir Tim in the context of current events. I do believe it's in his book "Weaving the Web" though.
You have to e causality reversed. Google results reflect the fact that society is dumb.
I'm not sure the idea of a sentient being not having a bias is meaningful. Reality, once you get past the trivial bits, is subjective.
Maybe not the same kind of bias we think of in terms of politics and such, but I wonder if there's a connection.
I want an AI that is biased to the truth when there is an objective one, and my tastes otherwise. (that is when asked to find a good book it should give me fantasy even though romance is the most popular genre and so will have better reviews)
So, AIs are actually on par with most adults now? (Sorry)
I hear stories about Flash and ActiveX but I literally never needed these to shop or pay bills online. Payments also didn't require scripts from a dozen domains and four redirects..
> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?
So if you for example compare "SDL tutorial" with "SDL tutorials". On google you'd get the same stuff, this search engine, for better or worse doesn't.
This is a design decision, for now anyway, mostly because I'm incredibly annoyed when algorithms are second-guessing me. On the other hand, it does mean you sometimes have to try different searches to get relevant results.
On the other hand, Google’s unawareness of (extensive and ubiquitous) Russian noun morphology is essentially what allowed Yandex to exist: both 2011 Yandex and 2021 Google are much more helpful for Russian than 2011 Google. I suspect (but have not checked) that the engine under discussion is utterly unusable for it. English (along with other Germanic and Romance languages to a lesser extent) is quite unusual in being meaningfully searchable without any understanding of morphology, globally speaking.
I... don’t think anything particularly surprising is happening here, except for quotes being apparently ignored? I’ve had it explained to me that a rare word is essentially indistinguishable from a popular misspelling by NLP techniques as they currently exist, except by feeding the machine a massive dictionary (and perhaps not even then). BRST is a thing that you essentially can’t even define satisfactorily without at the very least four years of university-level physics (going by the conventional broad approach—the most direct possible road can of course be shorter if not necessarily more illuminating). “Best” is a very popular word both generally and in searches, and the R key is next to E on a Latin keyboard. If you are a perfect probabilistic reasoner with only these facts for context (and especially if you ignore case), I can very well believe that your best possible course of action is to assume a typo.
How to permit overriding that decision (and indeed how to recognize you’ve actually made one worth worrying about without massive human input—e.g. Russian adjectives can have more than 20 distinct forms, can be made up on the spot by following productive word-formation processes, and you don’t want to learn all of the world’s languages!) is simply a very difficult problem for what is probably a marginal benefit in the grand scheme of things.
I just dislike hitting these margins so much.
(DDG docs do say it supports +... and even +"...", but I can’t seem to get them to do what I want.)
There are six of them (nominal [subject], genitive [belonging, part, absence, “of”], dative [indirect object, recipient, “to”], accusative [direct object], instrumental [device, means, “by”], prepositional [what the hell even is this]), so you have (cases) × (numbers) = 6 × 2 = 12 noun forms, and adjectives agree in number and gender with their noun, but (unlike Romance languages) plurals don’t have gender, so you have (cases) × (numbers and genders) = 6 × (3 + 1) = 24 adjective forms.
None of this would be particularly problematic, except these forms work like French or Spanish verbs: they are synthetic (case, number and gender are all a single fused ending, not orthogonal ones) and highly convoluted with a lot of irregularities. And nouns and adjectives are usually more important for a web search than verbs.
Hmm I seem to be getting only relevant results, no "best", not sure what you mean. Are you not doing verbatim search?
Re compounds, I expected they would be more or less easy to deal with by relatively dumb splitting, similar to greedy solutions to the “no spaces” problem of Chinese and Japanese, and your link seems to bear that out. But yeah, cheers to more language-specific stuff in your indexing. /s
Let's name it after a famous old scientist, and maybe add the year to prove it's modern: Galileo 2021.
You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.
I am skeptical of this application of the theorem. Here is my proposal:
Take the top 10 Google and Bing results. If the top result from Bing is in the top 10 from Google, display Google results. If the top result from Bing is not in the top 10 from Google, place it at the 10th position. You'd have an algorithm that ties with Google, say 98% of the time, beats it say, 1.2% of the time, and loses .8% of the time.
and the first conclusion is "something that you think will improve relevance probably won't"; the TREC conference went for about five years before making the first real discovery
It's true that Arrow's Theorem doesn't strictly apply, but thinking about it makes it clear that the aggregation problem is ill-defined and tricky. (e.g. note also that a ranking function for full text search might have a range of 0-1 but is not a meaningful number, like a probability estimate that a document is relevant, but it just means that a result with a higher score is likely to be more relevant than one with a lower score.)
Another way to think about it is that for any given feature architecture (say "bag of words") there is an (unknown) ideal ranking function.
You might think that a real ranking function is the ideal ranking function plus an error and that averaging several ranking functions would keep the contribution of the ideal ranking function and the errors would average out, but actually the errors are correlated.
In the case of BM25 for instance, it turns out you have to carefully tune between the biases of "long documents get more hits because they have more words in them" and "short documents rank higher because the document vectors are spiky like the query the vectors". Until BM25 there wasn't a function that could be tuned up properly and just averaging several bad functions doesn't solve the real problem.
Conditions the mixed ranker doesn't have to satisfy
"ranking while also meeting a specified set of criteria: unrestricted domain, non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives"
Hypothetically you could treat these functions as meaningful but when you try you find that they aren't very meaningful.
For instance IBM Watson aggregated multiple search sources by converting all the relevance scores to "the probability that this result is relevant".
A conventional search engine will do horribly in that respect, you can fit a logit curve to make a probability estimator and you might get p=0.7 at the most and very rarely get that, in fact, you rarely get p>0.5.
If you are combining search results from search engines that use similar approaches you know those p's are not independent so you can't take a large numbers of p=0.7's and turn that into a higher p.
If you are using search engines that use radically different matching strategies (say they return only p=0.99 results with low recall) the Watson approach works, but you need a big team to develop a long tail of matching strategies.
If you had a good p-estimator for search you could do all sorts of things that normal search engines do poorly, such as "get an email when a p>0.5 document is added to the collection."
For now alerting features are either absent or useless and most people have no idea why.
Suppose there's an unambiguous ranked preference by all people among a set (webpages, ranking). Suppose one search engine ranks correctly the top 5 results and incorrectly the next 5 results, while another ranks incorrectly the top 5 and correctly the next 5.
What can happen is that some there may be no universally preferred search engine (likely). In practice, as another commenter noted, you can also have most users prefer more a certain combination of results (that's not difficult to imagine, for example by combining top independent results from different engines for example).
There was such an app in the early 2000's, before Google went mainstream, and Altavista-like engines were not good: Copernic 2000.
I guess I'm officially old now.
This is a pattern I see over and over again, some research group or academics show that something can be done (summaries that make sense and are true summaries, evolutionary algorithm FPGA programming, real time gaze prediction, etc) and there's a few published code repos and a bit of news, then 'poof' - no where to be seen for 15 years or more.
Still, the best way to break SEO is to have actual competition in the search space. As long as SEO remains focused on Google there is an opportunity for these companies to thrive by evading SEO braindamage.
You see some of that "two readers" divide sometimes even in classic cookbooks, where "celebrity" chefs of the day might spend much of a cookbook on a long rambling memoir. Admittedly such books were generally well indexed and had table of contents to jump right to the recipes or particular recipes, but the concept of "long personal ramble of what these recipes mean to me" is an old one in cookbooks too.
One audience matches your description and is the invested reader. They want that blogger's story telling. they might make the recipe, but they're a dedicated reader.
The other audience is not the recipe-searcher, but instead Google. Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on. They won't even remember the blog's name. So the site isn't optimized for them. It's optimized for Google.
"Slow the parasitic recipe-searcher down. They're leeches, here for a freebie. Well they'll pay me in Google Rank time blocks."
This is not entirely true, though. If a randomly found recipe turns out particularly good, I'll bookmark the site and try out other dishes. It's a very practical method to find particularly good* recipe collections.
*) In this case "good" means what you need - not just subjectively "tasty", but e.g. low cost, quick to prepare, low calorie or in line with a particular diet and so on.
I think this assumes facts not in evidence. It certainly seems like an overwhelming number of "blogs" are not actual blogs but SEO content farms. There's no regular readers of such things because there's no actual authors, just someone that took a job on Fivver to spew out some SEO garbage. Old content gets reposted almost verbatim because new results better according to Google.
The only reason these "blogs" exist is to show ads and hopefully get someone's e-mail (and implied consent) for a marke....newsletter.
It's at least half the business models of Food Network shows: aspirational kitchens and the people that live in them and also sometimes here's their recipes. (The other half being competitions, obviously.) I've got friends that could deliver entire doctoral theses on the Bon Appetit Test Kitchen (and its many YouTube shows and blogs) and the huge soap operatic drama of 2020's events where the entire brand milkshake ducked itself; falling into people's hearts as "feel good" entertainment early in 2020/the pandemic and then exploding very dramatically with revelations and betrayals that Fall.
Which isn't to say that there aren't garbage SEO farms out there in the food blogging space as well, but a lot of the big ones people commonly complain about seeing in google's results do have regular fans/audiences. (ETA: And many of the smaller blogs want to have regular fans/audiences. It's an active influencer/"content creator" space with relatively low barrier to entry that people love. Everyone's family loves food, it's a part of the human condition.)
A couple of browser add-ons specifically geared around trimming recipe pages down have been taken down due to similar complaints.
Like an article about some current event will undoubtedly begin with "when I was traveling ten years ago...".
> “Mere listings of ingredients as in recipes, formulas, compounds, or prescriptions are not subject to copyright protection. However, when a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a combination of recipes, as in a cookbook, there may be a basis for copyright protection.”
That's ads. When mobile users have to scroll past 10 add, theyll click on some of them and make the blog money.
I continue to be curious about this kind of complaint. If all you want is a recipe list, without any of the fluff, why would you click on a link to a blog, rather than on a link to a recipe aggregator?
Foodie blogs exist specifically for the people who want a foodie discussion and not just an ingredients' list.
Is it because blogs tend to have better recipes overall? In that case, isn't there a bit of entitlement involved in asking that the author self-sacrificingly provides only the information that you want, without taking care of their own needs and wants, also?
To be honest, I don't follow recipes when I cook unless it's a dish I've never had before. At that point what I want is to understand the point of the dish. A list of ingredients and preparation instructions don't tell me what it's supposed to taste and smell like. The foodie blogs at least try to create a certain... feeling of place, I suppose, some kind of impression that guides you when you cook. I wouldn't say it always works but I appreciate the effort.
My real complaint with recipe writers is that they know how to cook one or two dishes well and they crib the rest off each other so even with all the information they provide, you still can't reliably cook a good meal from a recipe unless you've had the dish before. But that's my personal opinion.
If you want JUST recipes, pay money instead of just randomly googling around. America's test kitchen has a billion, vetted, and really good recipes. That solves that problem.
I also think a search engine like this would be quite hard to game. An ML-based classifier trained on thousands of text-heavy and media-heavy screenshots should be quite robust and I think would be very hard to evade, so the "game" will become more about how identify the crawler so you can serve it a high-ranking page while serving crap to the real users, and it seems fairly easily to defeat if the search engine does a second pass using residential proxies and standard browser user agents to detect this behavior (it could also threaten huge penalties like the entire domain being banned for a month to even deter attempts at this).
If you almost only plant wheat, you are going to end up with one hell of a pest problem.
If you almost only have Windows XP, you are going to have one hell of a virus problem.
If you almost only have SearchRank-style search engines (or just the one), you are going to have one hell of a content spam problem.
Even though they have some pretty dodgy incentives, I don't think google suffers quality problems because they are evil, I think ultimately they suffer because they're so dominant. Whatever they do, the spammers adapt almost instantly.
A diverse ecosystem on the other hand limits the viability of specialization by its very nature. If one actor is attacked, it shrinks and that reduces the opportunity for attacking it.
So just add human review to the mix, if a site is obviously trying to game the system (listicles, seo spam etc) just drop and ban them from the search index.
but arent you curious about the 7th reason? it will surprise you!
If you move data organization to another type of organization (non-profit, state, universities - private or public), then the question of data prioritization becomes highly political. What should be exposed? What should not? What to put first? ...
It is already, but to a smaller extend since money-making companies have little interest in data meaning, and high interest in the commercial value of their users.
Google's revenue last year was 146 billion, and it operates not anywhere near the theoretical maximum. Most of that revenue is advertisement.
Here's what I've tried with a few variations:
golang generics proposal,
machine learning transformer,
covid hospitalization germany
Another need I guess might be reviews, for which RT or MC are better than IMDB: not sure if either of those two will fare better than IMDB in this search engine but again Wiki has links out (in addition to good reception summaries)
I never even posted on it myself, but browsing the discussions one could learn all sorts of trivia, inside info, speculation, etc about each movie.
Since they (inexplicably) killed that feature, I rarely even visit anymore. Your right, for many purposes wikipedia is better, especially for TV series episode lists with summaries.
I think they removed it in part because new movies, like star wars and superhero movies, had alot of negative activity.
This seems to return a pretty decent number of sites relating to that (as well as some sites not relating to that).
The search engine isn't always great at knowing what a page is about, unfortunately.
This seemed to return mostly relevant results
This used to be how all search engines worked, but I guess people have been taught by google that they should ask questions now, instead of search for terms.
I wonder how I can guide people to make more suitable queries. Maybe I should just make it look less like google.
If your goal was to create a search engine that ignored listicles and other fluff and instead got you meatier results like "academic talks" and such, then no.
What kind of links where you expecting to find?
First result after Wikipedia:
"Radiophone Transmitter on the U.S.S. George Washington (1920)
In 1906, Reginald Fessenden contracted with General Electric to build the first alternator transmitter. G.E. continued to perfect alternator transmitter design, and at the time of this report, the Navy was operating one of G.E.'s 200 kilowatt alternators
Another result in the first few:
" - VANDERBILT, GEORGE WASHINGTON
PH: (800) ###-#233 FX: (#03) 641-5###.
And just below that terrible result:
"I Looked and I Listened -- George Washington Hill extract (1954)
Although the events described in this account are undated, they appear to have occurred in late 1928. I Looked and I Listened, Ben Gross, 1954, pages 104-105: Programs such as these called for the expenditure of larger sums than NBC had anticipated. It be http://earlyradiohistory.us/1954ayl2.htm
Dramatically worse than Google.
Ok, how about a search for "Rome" then? Surely it'll pull some great text results for the city or the ancient empire.
"Home | Rome Daily Sentinel
Reliable Community News for Oneida, Madison and Lewis County
The fourth result for searching "Rome":
"Glenn's Pens - Stores of Note
Glenn's Pens, web site about pens, inks, stores, companies - the pleasure of owning and using a pen of choice. Direcdtory of pen stores in Europe.
Again, dramatically worse than Google.
Ok, how about if I search for "British"?
"BRITISH MINING DATABASE
And after that:
"British Virgin Islands
Many of these photos were taken on board the Spirit of Massachusetts. The sailing trip was organized by Toto Tours. Images Copyright © Lowell Greenberg Home Up Spring Quail Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout, Oregon Wahkeena
Again, far off the mark and dramatically worse than Google.
I like the idea of Google having lots of search competition, this isn't there yet (and I wouldn't expect it to be). I don't think overhyping its results does it any favors.
If you are going top claim something is wide of the mark then you really ought to tell us at least roughly where the mark is.
As for the results you linked, it's part of the zeitgeist to list other entities sharing the same name. Sure, they could use some subtle changes in ranking, but overall the returned links satisfy my curiosity.
(Source: I looked up several Irish politicians because I run an all-text website containing every single word that they say in parliament. I got nothing of use, or even of interest, for anything.)
Not angry in the least. I'm thrilled someone is working on a search competitor to Google.
I understand you're attempting to dismiss my pointing out the bad results by calling me angry though. You're focusing your content on me personally, instead of what I pointed out.
The parent was far overhyping the results in a way that was very misleading (look, it's better than Google!). I tried various searches, they were not great results. The parent was very clearly implying something a lot better than that by what they said. The product isn't close to being at that level at this point, overhyping it to such an absurd degree isn't reasonable or fair to the person that is working on it.
I would specifically suggest people not compare it to Google. Let it be its own thing, at least for a good while. Google (Alphabet) is a trillion dollar company. Don't press the expectations so far and stage it to compete with Google at this point. I wouldn't even reference Google in relation to this search engine, let it be its own thing and find its own mindshare.
Except the author goes to quite some lengths to explain that his search engine is not a competitor to Google, and is in fact exactly the opposite of Google in many ways: https://memex.marginalia.nu/projects/edge/about.gmi
Like if you search for "How do I make a steak", you aren't going to get very good results. But a better query is "Steak Recipe", as that is at least a conceivable H1-tag.
But just a week ago I found out that these "how", "what" questions give better and faster results on Google.
The main pain-point, though, is that a lot of long-tail searches you could've used to find different results in years past, now seem to funnel you to the same set of results based on your apparent intent. At least, it has felt that way -- I'm not entirely sure how the modern google algorithm works.
I appreciate that it is easier for newcomers but I still hate it personally after years and especially that they cannot even avoid meddling with my queries even when I try to accept the new system and use the verbatim option.
A search engine that accepted regex as the search parameter would be amazing.
I actually used this method as a field filter for a bunch of simple internal tools to search for info. Originally people were asking for individual search capabilities, but I didn't want it to become a giant project with me as the implementor of everyone's unique search capability feature request - so I just gave them regex, encoded inputs into the URL query string so they can save searches - gave em a bunch of examples to get going and now people are slowly learning regex and coming up with their own "new features" :P
But this made sense because it's a relatively small amount of data, so small that it's searched in the front end which is why it's more of a filter... I don't think pure regex would scale when used as a query on a massive DB, it would need some kind of hierachy still to only bother parsing a subset of relevant text... unless there is some clever functional regex caching algorithm that can be used.
They used the naive approach: you searched for "steak", and they would bring the pages which included the word "steak".
The problem is that people could fool these engines by adding a long sequence like "steak, steak, steak, steak, steak, steak" to their site -- to pretend that they were the most authoritative page about steaks.
Google's big innovation was to count the referrers -- how many pages used the word "steak" to link to that particular page.
The rest is history.
I understand they are trying to maximize ad revenue and search does work very well for people who are looking for products or services.
But it no longer works well for finding information that is even slightly obscure.
I don't see a lot of people investing in SEO to boost their Marginalia results.
Then people fooled Google into showing the White House as top result when searching for "a miserable failure".
At the moment marginalia's approach of sorting pages into quality buckets based on lack of JS seems to be working extremely well, but of course it will be gamed if it gets popular.
However, I'd rather want SEO-crafting to consider itself with minimizing JS, rather than spamming links into every comment field on every blog across the globe ;-)
If you're looking for feedback, both from a UI design and utility standpoint, you might consider "inlining" results from selected sites, e.g. Wikipedia, stacked change, etc. Having worked on search for a long time, inlining (onebox etc) is a big reason users choose Google, and that channelers fail to get traction. If you're Serious(tm), dog into the publisher structure formats and format those, create a test suite, etc.
A word of caution: if this takes off, as a business it's vulnerable to Google shifting its algorithms slightly to identify the segment of users+queries who prefer these results and give the same results to those queries.
Hope this helps!
I've been working on an engine for personal websites, currently trying to build a classifier to extract them from commoncrawl, if you have any general tips on that kind of project they'd be very welcome.
Classification is really hard. I'm struggling with it myself, as a lot of like privacy policies and change logs turns out to share the shape of a page of text.
I'm thinking of experimenting with ML classifiers, as I do have reasonably good ways of extracting custom datasets. Finding change logs and privacy policies is easy, excluding them is hard.
Is there a way to suggest or add sites? I went looking for woodgears.ca and only got one result. I also think my personal blog would be a good candidate for being indexed here but I couldn't find any results for it.
Unfortunately this doesn't seem to be a feature which new search engines are focusing on - Brave Search also misses that feature...
A design sketch of the index is that it uses one file with sorted URL IDs, one with IDs of N-grams (i.e. words and word-pairs) referring to ranges in the URL file; as well as a dictionary for relating words to word-IDs; that's a GNU Trove hash map I modified to use memory map data instead of direct allocated arrays.
So when you search for two words, it translates them into IDs using the special hash map, goes to the words file and finds the least common of the words; starts with that.
Then it goes to the words file and looks up the URL range of the first word.
Then it goes to the words file and looks up the URL range of the second word.
Then it goes through the less common word's range and does a binary search for each of those in the range of the more common word.
Then it grabs the first N results, and translates them into URLs (through mariadb); and that's your search result.
I'm skipping over a few steps, but that's the very crudest of outlines.
Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'.
For the moment I have just south of 20 million URLs indexed.
1 x 20 million bytes = 20 Mb.
10 x 20 million bytes = 200 Mb.
100 x 20 million bytes = 2 Gb.
1,000 x 20 million bytes = 20 Gb.
10,000 x 20 million bytes = 200 Gb.
100,000 x 20 million bytes = 2 Tb.
1,000,000 x 20 million bytes = 20 Tb.
This is still within what consumer hardware can deal with. It's getting expensive, but you don't need a datacenter to store 20 Tb worth of data.
How many bytes do you need, per document, for an index? Do you need 1 Mb of data to store index information about a page that, in terms of text alone, is perhaps 10 Kb?
How do you rank the results (is it based on content only) or you have external factors too?
What is your personal preferred search option of the 7 and why?
Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.
Custom crawler, and I seem to get around 100 documents per second at best, maybe closer to 50 on average. Depends a bit on how many crawl-worthy websites it finds, and there is definitely diminishing returns as it goes deeper.
>How do you rank the results (is it based on content only) or you have external factors too?
I rank based on a pretty large number of factors, incoming links weighted by the "textiness" of the source domain, and similarity to the query.
> What is your personal preferred search option of the 7 and why?
I honestly use Google for a lot. My search engine isn't meant as a replacement, but a complement.
> Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.
Are you kidding? I think the Patreon is a resounding success! I'm still a bit stunned. I've gotten more support and praise, not just in terms of money but also emails and comments here than I could have ever dreamed possible.
And this is just the start, too. I only recently got the search engine working this well. I have no doubt it can get much better. The fact that I have 11 people with me on that journey, even if they "just" pay my power bill, that's amazing.
I'm honestly a bit at a loss for words.
And I am not kidding. I think for something that got so much attention on HN, where realistically this kind of product can only exist for now, the 'conversion' rate was very low. Billion dollar companies were made of HN threads with lot less engagement. Makes me wonder do we really want a search engine like this or we just like the idea of it?
And what are the barriers to use something like this? You say yourself that you are using Google most of the time. Is jumping to check results on this engine going to be too much friction for most uses?
Can something like this exist in isolation? What kind of value would it need to provide for users to remember using it en-masse as an additional/primary vertical search like they do for Amazon?
Just thinking out-loud as I am also interested in the space (through http://teclis.com).
Ultimately I think running something like this for profit would create really unhealthy incentives to make my search engine worse. Any value it brings, right now, it brings because it isn't trying to cater to every taste and every use case.
I also hate the constant "don't forget to slap the like and subscribe buttons"-shout outs of modern social media, even though I'm aware they it is extremely efficient. If I went down that route, I would become part of the problem I'm trying to help cure. I do feel the sirens' call though, it's intoxicating getting this sort of praise and attention.
I want this to be a long-term project, not a some overnight cinderella story.
In the end, my search engine is never going to replace google. It isn't trying to, it's trying to complement it. It's barely able now, but hopefully I can make it much better in the months and years to come.
This allows quite a bit of its own kind of freedom even if maximum financial opportunity is not fully exploited. Perhaps even because you are not grasping for every dollar on the table at all times.
You can do things without having to know if they will pay off, and if it turns out big anyway you can make money as a byproduct of what you do rather than having pure financial pursuit be the root of every goal.
So given your setup is already ideal for 'conversions' for this population (low profile, high integrity, no BS) I was simply genuinely surprised that only 11 people converted given enormous visibility/interest this thread had. Hope that makes sense.
I'd absolutely consider sending someone money if they kept bringing something of value into my life. If I want more people to join the patreon, I'll just have to earn their trust and support.
Plus another excellent feature was you would get the same search results no matter who or where you were for quite some period of calendar time.
If something new did appear it was likely to be one of the new sites that was popping up all the time and it was likely to be as worthwhile as its established associates on the front page.
You shouldn't need to crawl nearly as fast if you can compensate by treading more suitably where those have gone before.
How does the range binary search work, does it just prune out the overlaps, how efficient is it and how much data do you have in there for say "hello" and "world" f.ex?
The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence.
Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8.
The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8.
If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap.
This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10)
Funny I also selected "hello" and "world" above! Xo
My system is also written in Java btw!
Here are example results of my word search:
But yeah, in short pseudocode:
for url in range-for-"hello":
if binary-search (range-for-"world", url):
I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.
What can we do to foster a sustainable bazaar of projects to make it easier to build web search engines?
Now, that is assuming you aren't on some VPS provider. If you're going to crawl, you'll have the best chance when you use your own IPs on your own ASN, with DNS and reverse DNS set up correctly. This makes it so the IP reputation systems can detect you as a crawler but not one that hammers every site it visits.
Also, I imagine that, for a search engine like this, it doesn't expect content to change much anyways - so it can take its time crawling every site only once every month or two, instead of the multiple times a week (or day) search engines like Google have to for the constantly-updated content being churned out.
You may already be aware of this, but the page doesn't seem to be formatted correctly on mobile. The content shows in a single thin column in the middle.
I'd prefer if it does just one thing and does that really well. Don't waste your time on calculator and conversion functions, or pseudo-natural language queries. There are plenty of good calculator & converter tools and websites, but we all need a really good search engine. I think you'd be better looking at handling synonyms and plurals.
Don't worry, this stuff is easy, and doesn't even remotely take away from the work on the harder problems.
I originally targeted a Raspberry Pi4-cluster. It was only able to deal with about 200k pages at that stage, but it did shape the design in a way that makes very thrifty use of the available hardware.
My day job is also developing this sort of highly performance java applications, I guess it helps.
> What is your stack, elastic search or something simper?
It's a custom index engine I built for this. I do use mariadb for some ancillary data and to support the crawler, but it's only doing trivial queries.
> How did you crawl so many websites for a project this size?
It's not that hard. Like it seems like it would be, and there certainly is an insane number of edge cases, but if you just keep tinkering you can easily crawl dozens of pages per second even on modest hardware (of course distributed across different domains).
> Did you use any APIs like duck duck go or data from other search engines?
Nope, it's all me.
> Are you still incorporating something like PageRank to ensure good results are prioritized or is it just the text-based-ness factor?
I'm using a somewhat convoluted algorithm that takes into consideration the text-based-ness of the page, but also how many incoming links the domain has, but it's a weighted value that factors in the text-based-ness of the origin domains.
It would be interesting to try a page rank-style approach, but my thinking is that because it's the algorithm, it's also the algorithm everyone is trying to game.
Is there any way that you can get an HTTP certificate?
I use an old iPhone 4S, and most of the modern web is inaccessible due to TLS. Hacker News and mbasic.facebook are two of the last sites I can use.
Usually text-based sites are more accessible, so this could be really useful to help me continue using my antique devices!
Interesting idea. Definitely see an overlap with eReader markets and looking at text only contents.
How does it work?
I use external libraries for parsing HTML (JSoup) and robots.txt; but that's about it.
But I've since expanded my websites, so now I think these play a decent role in later iterations, although they are virtually all of them pages I've found eating my own dogfood:
It's hosted on my consumer-equipment server (Ryzen 3900x, 128Gb ram, Optane 900p+a few IronWolf drives), bare bones on Debian.
I have a criticism that I think may pertain to the ranking methodology. I searched for "discovery of Australia". Among the top results were:
* A site claiming that the biblical flood was caused by Earth colliding with a comet (with several other pages from that site also making the top search results with other wild claims, e.g. that the Egyptians discovered Arizona);
* Another site claiming the first inhabitants of Australia were a lost tribe of Israel;
* A third site claiming that Australia was discovered and founded by members of a secret society of Rosicrucians who had infiltrated the Dutch East India Company and planned to build an Australian utopia...
Arguably, this seems to rank the way Google's engine used to, since it couldn't run JS and they wanted to punish sites that used code to change markup at render time. At least, when I used to have to do onsite SEO work, it was always about simple tag hierarchies.
I wonder whether there isn't some better metric of validity and information quality than what markup is used. Some of the sites that surfaced further down could be considered interesting and valuable resources. I think not punishing simple wall-of-text content is a good thing. But to punish more complicated layouts may have the perverse effect of downranking higher-quality sources of information - i.e. people and organizations who can afford to build a decent website, or who care to migrate to a modern blogging platform.