Hacker News new | past | comments | ask | show | jobs | submit login
Yandex ‘leak’ reveals search ranking factors (searchengineland.com)
278 points by Password2 on Jan 30, 2023 | hide | past | favorite | 90 comments

Interesting ones I have found:

108 MusicQ: The musicality of the request. The results of the sorcerer Anton Konygin.

276 NightQuery: The request is set mainly at night (and accompanying ones for morning, afternoon etc)

316 IsUa: Domain in the .ua zone (ukraine)

747 WordHostWikiSum: The relative popularity of the Word -Host pair, where Word is the word from the Title article on Wikipedia, and the Host is the host that is referred to in this article.

755 NastyContent: Content ugliness factor. (several of the other factors related to port and "adult" content are defined as multiples of this)

892 ShortVideo: A document is a short video (Tiktok, Reels, Shorts)

894 TelegramPost: Document - post in telegram (note I could not find similar for twitter etc, but they do have for tiktok)

1076 More90SecVisitsShare: The share of visits for which the time spent during the day on the host is more than 90 seconds (they track how long users are spending on pages)

1078 RankHackedNovaPhp: Rank of hacked sites

1309 DistanceToAnkara: The distance from the city from where the request was set to Ankara

1603 YellowImgCount: Average yellow images count on host

1614 NewsAgencyRating: Rating of news agency from agencies.json (Yandex.News resource)

Everything in there makes sense imho except the Yellow Image Count. Anyone knows why that would be a factor?

It's been a long time since I've read it, but I vaguely remember a write-up many years ago (before we used ML for NSFW detection) about how either yellow images would often produce false positives in popular detection algos, and/or about how the yellow-ness of an image could be used as a lower-tech proxy for basic NSFW detection (seemingly because of its proximity to some common skin tones).

I don't know if either of these are relevant here, but it's possible a weight on Yellow Images could be related to NSFW images.

Upon closer inspection, they have a lot of weights regarding "yellowness" [0], and several of them say "based on Toloka", which is an antifraud service that assigns colors to violation levels [1].

So my suspicion above is probably wrong, and "yellow" images probably refer to a Toloka score/color for whatever domain the images are hosted on.

[0] https://yandex-explorer.herokuapp.com/search?q=yellow&o=all

[1] https://join.toloka.ai/blog/project-limits/

Wrong Toloka. Yandex has its own Toloka which is a crowdsourcing platform for repetitive tasks that inherently require a human in the loop. In other words, this factor comes from real human assessors hired via Toloka.

No, but can you explain the distance to Ankara one?

well now you're making me doubt myself, that's a weird one too. My first guess was "they need to know how far one request is from another, and Ankara is a good as any city"

So say suppose you're in New Orleans and I am in San Jose, now we can compare our distance, because geographical distance usually equates to cultural differences and perceptions of quality.

They have a Turkish version, so I guess this props up Turkish results to requests from Turkey (granted, Ankara is not exactly in the middle of the country ;-)

Ankara is the nearest city to Kırşehir, considered by some to be the geographical centre of the world (as viewed on a typical flat map).

there was exactly one other similar one (Distance to Madagan) which is a random russian town. So I think it might be deciding which of their datacenters should serve the request?

It is a bit funny to see people trying to guess the meaning of something which is a common knowledge elsewhere. Puts a perspective on all the talk about “modern, global, connected, automatically translated world”.

For a decade and a half, Yandex has had a codename for each big revision of search ranking algorithms and release of new search features. Most were city names, and Magadan was one of them. So “distance” is probably a distance to some rating introduced in that release. Whether it is a positive or negative factor (cutting obvious over-optimized sites), I have no idea.

Ankara is the capital of Turkey. For RU, UA & BY locales they have separate ranking models for queries from their capitals, but not for TR. Maybe, guessing, they did not have enough data to build two TR models and this additional input feature was good enough

I’m pretty sure yellow is a slang for a category of images. It could be ad images.

Could be! Perhaps as in Yellow Pages?


In other words, clickbait images from clickbait ad networks.

> NewsAgencyRating: Rating of news agency from agencies.json

Was the agencies.json published?

sadly I couldn't find it. A lot of the entries make reference to their corporate wiki and intranet though so its possible it hasn't been released.

I agree though, would be very interesting indeed to see who they promote and/or push down

> 108 MusicQ: The musicality of the request. The results of the sorcerer Anton Konygin.

Sounds fascinating. Anyone have any idea what this is about? Is [0] the Konygin in question?

[0] https://orcid.org/0000-0002-0037-2352

Apparently this leak is a huge boon to SEO. Expect more SERPs to become SEO-overtaken garbage dumps in the near future. Truly depressing.

I'm doubtful this will have much of an effect for three reasons:

- These are factors, not the underlying algorithms which use them, which means it's not possible to deduce your own ranking from that data.

- Many of those are so vague as to be useless.

- The most important factors for SEO are those we already know about.

It's possible to use an algorithmic approach (i.e. linear regression) to derive the weights for each of these factors.

This is assuming the weights are combined via some linear function. That's not necessarily the case. It's likely that the ranking algorithm is more complex than: calculate features, multiply by weights, add to get ranking. Sure you'll probably still be able to learn something more by playing with the different features, but I doubt you'll get any real meaningful "weights." Especially when how many features are calculated is a black box.

Why would it be more complicated? A simple dot product would have excellent scaling characteristics

I designed the ranking algorithm for our product search. There are several factors in it that are nonlinear.

For example, we obviously want the most-viewed and most-bought products to rise to the top. However, we also expect there's an initial honeymoon period for many new products, where people want to see them but they don't have enough history yet to sway the popularity factor. So there's a non-linear term that looks kind of like

  ( weight_factor / (product_age_days * age_scale_constant) )

At a glance it seems like there works be a way to form the weighting adjustment into a subquery and multiply by the reciprocal. And then it’s still multiply-accumulate, but maybe I’m missing something.

Isn't that just saying "delegate the nonlinear part out to a black box"?

At a glance (1 / (product_age_days * age_scale_constant)) just seems like another factor you can multiply (with the calculation of reciprocal costing more in compute time). Again, I'm more than likely misunderstanding you or lacking the context.

When it comes to ML scalability is a constraint not a goal. The goal is to minimize some loss function and it turns out simple dot product can be outperformed by more complex algorithms.

I remember reading a few years ago that most search engines use some tree based model. If that's the case, that means the idea of monotonic linear weights is not relevant.

Can you be more specific? Dot product is about as performant as it gets with linear memory access and SIMD multiply accumulate. Throw random memory access and flow control in there and it’s a struggle to do it faster. Unless the factors are sparse, in which case just elide the zero values.

> scalability is a constraint not a goal. The goal is to minimize some loss function

My bad. I was under the impression that most search engines are compute bound, but if anything there’s probably a glut of compute for such applications and a market appetite for better results.

Also, I'd assume it's highly time-agnostic (i.e. content change timespan : compute availability timespan).

So you can run your bulk-recomputing whenever you have spare capacity.

Stale rankings aren't great, but don't hurt that much. As long as your liveness is more frequently updated, so you don't send people to dead sites.

Certainly caching is important, especially for Word2Vec or other NLP which you'd want to happen in a separate stage after crawl, but as someone mentioned in a sibling comment, there are some factors that are calculated per-query, which can have a lot of cache misses for novel queries.

If so, I'd highly suspect Google varies the compute/cache permitted for novel queries.

By this point, I can't imagine they haven't automatically balanced {revenueFromQuery} to {costOfQuery}.

No sense delivering hyperoptimized results if you lose money on generating them.

I’d suspect you’re right

Given some samples of search results we still don't know:

- The X matrix (e.g. the page rank score) for each result.

- The y vector, i.e the score for each result. Although we can observe the relative ranking in each sample (would be interested to hear about techniques to cope with this).

A lot of these factors depend on internal yandex data, serp clicks especially - good luck getting those outside of yandex prod, and those are the strongest signals.

All you need to do is scrape the whole internet beforehand.

It's not. In addition to what ad404b8a372f2b9 said in a sibling comment:

- nobody in the West cares about Yandex SERPs and Yandex isn't Google, so you can't easily transfer the factors

- Your favorite SERPs aren't being overtaken by SEO garbage because they're not big/profitable enough to catch attention. It's really more of a dark forest thing in SEO, and search volume is how you detect prey. SEOs don't care about things they cannot make a lot of money from.

- SEOs aren't that technical. They employ some technical people (me for instance), but they don't listen to technical people, they care about links links links links, headlines, title, description, microdata (for fancier Google SERP display because having a fancy display can make rank 2 work just as well as rank 1) and keywords. Some do keyword-related stuff like WDF*IDF, but it's more of a ritual where they throw text into different tools and wait for all lamps to turn green. They're really not that sophisticated. Source: am working for large affiliates with millions of visits per month.

I'm curious: how much of modern SEO is low-level trickery, and how much is deeper marketing strategy ?

From my understanding, they're somewhat distinct. There are blackhats that try to exploit some quirk and shoot for quick & good rankings that get killed when Google notices or patches a bug etc. The people I work for are mostly light-grey-hats and don't usually do this because they run large sites they don't want to put at risk and they're very conservative about changing anything and doing anything that could be remotely viewed as malicious by Google (even when it's obviously not malicious and would be good for users, there's a lot of "yeah, I agree, but I did/hear about/saw something somewhat similar 10 years ago and Google killed the site and it never recovered" in SEO).

Everyone in SEO treats Google like a god. If you have a somewhat stable + successful project, you're sacrificing things to Google (let's all adopt AMP, yes, it's so great! let's all do CWV and say we believe in user experience!) and praise Google each day and avoid anything that you've heard might displease Google for fear of Google sending down lightning and turning your projects to ashes. If you don't have one of those golden geese, you must be more nimble and make sure that the lightning only strikes where you've been yesterday and doesn't catch up with you.

The blackhat part is much more stressful (I've had clients get super depressed when their old tricks no longer worked and they burned site after site after site and nothing worked for months), and everyone I know that did that has transitioned to whitehat as soon as an opportunity presented itself and gotten rid of all the blackhat stuff to not have their main sites get caught in a penalty.

Given that is there anything someone without an expiring domain can do to rank besides attempt to get back links?

I'm not terribly deep into how to rank, I mostly just build features for sites, tools for analysis etc, but from what I learn while working on those: be a trusted expert in your niche and publish stuff that the SEO editors don't understand or don't have time to learn. And then make sure you get a bunch of links.

I have no idea how ML generated content will change all that, but Expertise, Authority and Trust (=EAT) is what everybody has been worried about for the past year, and whether Google will believe them that they're experts and should be trusted.

>They employ some technical people (me for instance), but they don't listen to technical people

what are some suggestions you make that your clients don't follow?

Not so much suggestions, it's more that they have a lot of cultish beliefs from ancient times. E.g. it takes a lot of reassurance to make them understand that Google won't consider using a CDN as negative because you'd share the IP with other sites and some of those might be bad, because it would affect most of the internet.

They vastly over-estimate Google's abilities to identify patterns ("we use similar wording on other sites, that's a pattern" for wording that everyone uses, like talking about the beach distance on a hotel review, or just using bootstrap) and are like "I don't understand, but I trust you so I'll consider this okay ... for now". If some ranking drops, they're always eager to roll back any changes that were done, even if they didn't affect the frontend ("who knows how Google works").

At the same time, while they want to avoid patterns to connect their sites, they run all their sites in the same search console and analytics account.

Most of them pick up very few technical things and transition from Product Management into SEO by learning how to run reports and what blogs to read for instructions. Introducing the DOM inspector is something I'm often blowing minds with, same with viewing request in the network console ("you mean I don't need the redirect tester 3000 any more?")

I've only once worked for clients that had someone doing SEO who was deep into technical SEO and had all kinds of automation set up and roughly understood how HTTP works, the differences between server and client etc.

Much of it is quasi-religious, I've often jokingly suggested trying to sacrifice a chicken. I'm not sure everybody laughed and never gave it some serious thoughts.

Any general recommendations for SEO blogs?

I don't know, I don't read any :)

I get involved when the SEOs have decided what to do, but I don't keep up with trends and developments.

It is reasonable to assume there is some cross-contamination in search engine development between the two, and therefore yandex's factors might be useful for google as well.

Probably, but I don't think there's really any secret sauce they're sharing. Many of the factors are things every developer would come up with if they sat down and asked themselves "what would I do if I wanted to rank these results?".

Of course, the "how exactly is it implemented", "what weight does this factor have", "can I exploit how it works" etc is another story, but if you started building a search engine and wrote down your ideas about ranking and approaches to deal with some manipulation attempt etc, I'm sure you'd come up with a lot of the same things.

not really, 90% of SEO is backlinks/site authority. You can post the same article to a new site and not rank at all, or post it to a stronger old domain and have it rank #1.

Google got a lot of flak over the fake news stuff so they've gotten even more conservative with what they rank in recent years. I think that's a big reason people have noticed results getting worse, there is a lot of really great niche stuff on small blogs but Google would rather rank more generic content from an established website. A few major sites like CNET have already gotten caught churning out AI produced stuff and Google still ranks it due to site authority

i always wonder why google doesnt offer a downvote button for obvious bad links. the only reason i can think of is vote bombing, but if you contain that for personalization and do basic bomb detection you could greatly improve the xperience for people

A link that is bad for you might serve a lot of ads for a lot of people, which is good for company that has most of its profits from commissions on that process (Google).

Here's the shortest explanation of why you'll never get anything better.

Google already wants a working phone number for majority of users, making it logged only and leave it to the community to decide. But it will conflict with its "paid" programs so I guess it will never happen.

SERP = search engine result page

Apparently. Had heard of SEO, but not SERP

From the article, the interesting factors seem to be things like number of unique visitors and percentage of organic traffic.

The latter is interesting as a factor because it suggests that visitors are visiting for the value rather than because they've been targeted by ads, implying higher value.

Not surprised that some websites have an artificial preference though. It feels like this is basically the selling point (other than privacy) of Kagi, DDG with its bang commands, and other alternative search engines – artificial inflation of rank.

I find it interesting, that even though many people click on a google link to some page (without checking the url), see the page, in two seconds click back, and choose a different result, that all that doesn't downgrade that pages page rank (if a user (many of them) comes back to search results in two seconds, the content was obviously garbage).

Some pages, that I won't name, but one of them starts with a pin- and ends with -terest consistently get high google results while offering garbage content.

> ...comes back to search results in two seconds, the content was obviously garbage

I don't think that if a user stays a short time necessarily means the content is garbage. It might also mean that the user found directly what he was looking for, such as the answer to the question.

It might even mean the page is very good as it is clear and provides the answer and does not have other click-bait type content.

Landing on a bad page, might require you to read and browse around and maybe after a while figure out it's garbage or you get distracted by some "interesting", but unrelated articles. But if the search engine would rank it high because of this then that would be wrong.

If a user clicks through and don't end up back at the search engine, there is probably a high probability they found the answer whether or not they spent a short or long time at the page. But if they click through and then two seconds later comes back to the search engine results, that seems unlikely.

I tend to scroll down the first few pages of results and open anything that appears interesting even if it's not really relevant to what I wanted. Hopefully that doesn't look similar enough that it might impact the ratings of sites on other search engines. It's possible that the first few results got me to what I was looking for (that's far from a given these days though) but I'm clicking on a lot of other results in a short amount of time regardless.

This would be true, if a user then did another search for something else, or close the tab. If you got an answer, you usually don't come back and click the next link in the results.

For me the pattern changed in the past few years. Before, I would take the first good answer as the answer, they used to be of good quality. Now, I tend to look at a couple (up to 4) of results to kind of cross-validate the answer before using it. Sometimes it takes me quite some "click through the results pages" to find 4 diverse enough results to validate the answer.

But looking at my kids, looking directly at the info box and considering the answer as given by God, I suppose I am an exception.

Well, why would it downgrade pagerank which is based on web graph connectivity? You're not changing the graph in any way with your click, no?

But Google definitely tracks such short clicks and considers them a bad signal, very important in ranking. Same for Yandex, search for "dwell time" (visit duration) in its code/factors.

I often open a bunch of the results in new tabs. Which could be interpreted as I'm unhappy with the first ones I open, but I haven't even looked at them at that point.

Well sure, but i'm sure google can detect if you clicked 5 results in five tabs, or if you've clicked on one, and clicked "back" immeditaly... if a huge amount of users stay only 2-3 seconds on a specific page and then come back, that would be a red flag for me.

On the other hand, maybe some people actually like the pinterest pages in results, and i'm the weird one... who knows.

> On the other hand, maybe some people actually like the pinterest pages in results, and i'm the weird one... who knows.

It isn't just you! I suspect that for certain people it's easy for them to get distracted there though.

Some people believe that it does, that's why they pay users to search for something and click through the serps & go back until they arrive at the target domain where they don't click back but stop the experiment, signalling that this page answered their question / solved their problem.

I've seen some data on this, and that data was inconclusive. Sometimes it seems to work, sometimes it seems to be counter-productive, sometimes nothing happens. It's hard to truly test things in isolation, and maybe their methods were shit, but I don't think it really does anything, or maybe Google has solid ways to detect anomalies and ignore them.

It's possible that is a factor, but that some sites are just very good at boosting the other metrics, or that they provide some other kind of perceived value and are therefore up-ranked.

Do Google et al really not check for that?

Of course they do.

In fact, a good way to downrank a competitors pages is to do a search for them, click the result, then click back 5s later. Repeat 1000x from different google accounts. A week later, the competitor will go off the first page.

Did you try that in practice?

They absolutely do

Yep, would be interesting to test if the "percentage of organic"-factor also is relevant in Google. This would mean that they penalize pages for buying their own product.

Could someone explain to me how you go from these ranking facts to an actual sorted list of results? and doing it with an acceptable latency. Also, what happened to Google page rank, is is still relevant today?

You throw them into a machine learning model together with a big dataset of queries/urls annotated by humans for relevancy. Catboost is yandex's choice of model here.

> and doing it with an acceptable latency

Lots of interesting optimizations possible here, but the big obvious one is multiple level models: score documents with a cheap model (FastRank in yandex lingvo) first using a subset of the fastest available features, then rescore top docs with your best slow expensive model. Perhaps rescore multiple times at different points in the stack with models of varying complexity, at each index shard and after aggregating the results from subset/all shards. Also sort documents in each index shard by some other ML model with query-independent features to push all the junk to the end of the index where you'd likely skip it when running out of time budget to process a query.

> Also, what happened to Google page rank, is is still relevant today?

Vanilla 1990s' pagerank obviously not, but the idea of such graph-based calculations is still very useful yes.

> Vanilla 1990s' pagerank obviously not,

what did we learn about the flaws?

It's old and too simple for today's web, everyone knows it and everyone games it. But the idea behind it is still useful, just need more tricks, more ML etc.

For an exhaustive answer, you can check from Yandex source code that was posted on another thread on HN


Pages are sorted by page rank, you pick first 10-100k pages and sort using other factors, then take first 10-100 results and sort using most costly factors.

"SEOs have already started analyzing Yandex's search ranking factors, which include PageRank and several other link-related factors"

PageRank is factor 0.

All of the listed "facts" or "features" can be incorporated into an index (e.g. with Lucene) or fed into ML-models (logistic regression, trees, neural nets, etc).

Without going into specifics, I've seen ranking treated as a multi-stage algorithmic problem. Initially, you rank results with a lower-quality but low-latency ranker at the first stage (inverted index, tf-idf, knn, etc), and subsequent stages rerank the top-K results with higher-latency ML models outputting a relevance score.

I believe progress is currently being made to combine everything into one giant neural model that just ranks everything from the get-go rather than pass in multiple stages.

I can't answer, but you should be able to find the answer on https://yacy.net -- free web search. By the way, it can probably tremendously benefit from the leak.

It's interesting that they consider the presence of ads and the presence of Yandex ads as separate factors. I wonder if this implies a priority for pages with Yandex ads.

at least we can be sure google would never consider such a ranking

That would explain why some content farms rank about e.g. Stack Overflow and Wikipedia with content scraped from those sites.

Really? /S it’s the only reason why I have Adsense.

Maybe I'm just a newbie when it comes to building search algorithms, but how on earth would you maintain or test something like this? From experience, I know that search algorithms using far fewer ranking factors than this are deployed in production and treated as terrifying black boxes that no one wants to touch lest they break something in an unexpected way.

Testing is easy with any half decent development setup. You should have some train/eval datasets and monitor metrics on them during training, this is ML 101. And do live A/B experiments for launch candidates.

Maintenance sure is hell, lots of sweat and tears. Just glancing over search/formula/webcommon/select_ranking_models.cpp makes me cringe, they must have many dozens if not hundreds of different models in prod by now. Each of them needing maintenance and lots of training data. Work on new ranking factors I suspect must be also highly frustrating: throwing stuff at wall^W catboost black box and seeing if it sticks, and if it doesn't you'd have little idea why and control over it. Imho google's approach (white-boxish interpretable top level ranking formulas) is far superior and maintainable at scale.

95 - Ukrainian

> It is equal to one if the site has a Ukrainian geoist (i.e. 1 - Ukrainian site)

This is completely unrelated to recent events. Before initial invasion in 2014 many Russian IT companies had huge presence in Ukraine and many Yandex services including search had localized versions and own search ranking, etc. Yandex still have versions for Belarus, Kazahstan, other ex-USSR countries and even Turkey. But chances for international expansion was pretty much destroyd by political situation.

Also keep in mind Yandex was different company a decade ago. Back then Ilya Segalovich been CTO there (he died of cancer in 2013) and he supported opposition and even participated in street protests in Moscow.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact