Hacker News new | past | comments | ask | show | jobs | submit login

Apparently this leak is a huge boon to SEO. Expect more SERPs to become SEO-overtaken garbage dumps in the near future. Truly depressing.



I'm doubtful this will have much of an effect for three reasons:

- These are factors, not the underlying algorithms which use them, which means it's not possible to deduce your own ranking from that data.

- Many of those are so vague as to be useless.

- The most important factors for SEO are those we already know about.


It's possible to use an algorithmic approach (i.e. linear regression) to derive the weights for each of these factors.


This is assuming the weights are combined via some linear function. That's not necessarily the case. It's likely that the ranking algorithm is more complex than: calculate features, multiply by weights, add to get ranking. Sure you'll probably still be able to learn something more by playing with the different features, but I doubt you'll get any real meaningful "weights." Especially when how many features are calculated is a black box.


Why would it be more complicated? A simple dot product would have excellent scaling characteristics


I designed the ranking algorithm for our product search. There are several factors in it that are nonlinear.

For example, we obviously want the most-viewed and most-bought products to rise to the top. However, we also expect there's an initial honeymoon period for many new products, where people want to see them but they don't have enough history yet to sway the popularity factor. So there's a non-linear term that looks kind of like

  ( weight_factor / (product_age_days * age_scale_constant) )


At a glance it seems like there works be a way to form the weighting adjustment into a subquery and multiply by the reciprocal. And then it’s still multiply-accumulate, but maybe I’m missing something.


Isn't that just saying "delegate the nonlinear part out to a black box"?


At a glance (1 / (product_age_days * age_scale_constant)) just seems like another factor you can multiply (with the calculation of reciprocal costing more in compute time). Again, I'm more than likely misunderstanding you or lacking the context.


When it comes to ML scalability is a constraint not a goal. The goal is to minimize some loss function and it turns out simple dot product can be outperformed by more complex algorithms.

I remember reading a few years ago that most search engines use some tree based model. If that's the case, that means the idea of monotonic linear weights is not relevant.


Can you be more specific? Dot product is about as performant as it gets with linear memory access and SIMD multiply accumulate. Throw random memory access and flow control in there and it’s a struggle to do it faster. Unless the factors are sparse, in which case just elide the zero values.


> scalability is a constraint not a goal. The goal is to minimize some loss function


My bad. I was under the impression that most search engines are compute bound, but if anything there’s probably a glut of compute for such applications and a market appetite for better results.


Also, I'd assume it's highly time-agnostic (i.e. content change timespan : compute availability timespan).

So you can run your bulk-recomputing whenever you have spare capacity.

Stale rankings aren't great, but don't hurt that much. As long as your liveness is more frequently updated, so you don't send people to dead sites.


Certainly caching is important, especially for Word2Vec or other NLP which you'd want to happen in a separate stage after crawl, but as someone mentioned in a sibling comment, there are some factors that are calculated per-query, which can have a lot of cache misses for novel queries.


If so, I'd highly suspect Google varies the compute/cache permitted for novel queries.

By this point, I can't imagine they haven't automatically balanced {revenueFromQuery} to {costOfQuery}.

No sense delivering hyperoptimized results if you lose money on generating them.


I’d suspect you’re right


Given some samples of search results we still don't know:

- The X matrix (e.g. the page rank score) for each result.

- The y vector, i.e the score for each result. Although we can observe the relative ranking in each sample (would be interested to hear about techniques to cope with this).


A lot of these factors depend on internal yandex data, serp clicks especially - good luck getting those outside of yandex prod, and those are the strongest signals.


All you need to do is scrape the whole internet beforehand.


It's not. In addition to what ad404b8a372f2b9 said in a sibling comment:

- nobody in the West cares about Yandex SERPs and Yandex isn't Google, so you can't easily transfer the factors

- Your favorite SERPs aren't being overtaken by SEO garbage because they're not big/profitable enough to catch attention. It's really more of a dark forest thing in SEO, and search volume is how you detect prey. SEOs don't care about things they cannot make a lot of money from.

- SEOs aren't that technical. They employ some technical people (me for instance), but they don't listen to technical people, they care about links links links links, headlines, title, description, microdata (for fancier Google SERP display because having a fancy display can make rank 2 work just as well as rank 1) and keywords. Some do keyword-related stuff like WDF*IDF, but it's more of a ritual where they throw text into different tools and wait for all lamps to turn green. They're really not that sophisticated. Source: am working for large affiliates with millions of visits per month.


I'm curious: how much of modern SEO is low-level trickery, and how much is deeper marketing strategy ?


From my understanding, they're somewhat distinct. There are blackhats that try to exploit some quirk and shoot for quick & good rankings that get killed when Google notices or patches a bug etc. The people I work for are mostly light-grey-hats and don't usually do this because they run large sites they don't want to put at risk and they're very conservative about changing anything and doing anything that could be remotely viewed as malicious by Google (even when it's obviously not malicious and would be good for users, there's a lot of "yeah, I agree, but I did/hear about/saw something somewhat similar 10 years ago and Google killed the site and it never recovered" in SEO).

Everyone in SEO treats Google like a god. If you have a somewhat stable + successful project, you're sacrificing things to Google (let's all adopt AMP, yes, it's so great! let's all do CWV and say we believe in user experience!) and praise Google each day and avoid anything that you've heard might displease Google for fear of Google sending down lightning and turning your projects to ashes. If you don't have one of those golden geese, you must be more nimble and make sure that the lightning only strikes where you've been yesterday and doesn't catch up with you.

The blackhat part is much more stressful (I've had clients get super depressed when their old tricks no longer worked and they burned site after site after site and nothing worked for months), and everyone I know that did that has transitioned to whitehat as soon as an opportunity presented itself and gotten rid of all the blackhat stuff to not have their main sites get caught in a penalty.


Given that is there anything someone without an expiring domain can do to rank besides attempt to get back links?


I'm not terribly deep into how to rank, I mostly just build features for sites, tools for analysis etc, but from what I learn while working on those: be a trusted expert in your niche and publish stuff that the SEO editors don't understand or don't have time to learn. And then make sure you get a bunch of links.

I have no idea how ML generated content will change all that, but Expertise, Authority and Trust (=EAT) is what everybody has been worried about for the past year, and whether Google will believe them that they're experts and should be trusted.


>They employ some technical people (me for instance), but they don't listen to technical people

what are some suggestions you make that your clients don't follow?


Not so much suggestions, it's more that they have a lot of cultish beliefs from ancient times. E.g. it takes a lot of reassurance to make them understand that Google won't consider using a CDN as negative because you'd share the IP with other sites and some of those might be bad, because it would affect most of the internet.

They vastly over-estimate Google's abilities to identify patterns ("we use similar wording on other sites, that's a pattern" for wording that everyone uses, like talking about the beach distance on a hotel review, or just using bootstrap) and are like "I don't understand, but I trust you so I'll consider this okay ... for now". If some ranking drops, they're always eager to roll back any changes that were done, even if they didn't affect the frontend ("who knows how Google works").

At the same time, while they want to avoid patterns to connect their sites, they run all their sites in the same search console and analytics account.

Most of them pick up very few technical things and transition from Product Management into SEO by learning how to run reports and what blogs to read for instructions. Introducing the DOM inspector is something I'm often blowing minds with, same with viewing request in the network console ("you mean I don't need the redirect tester 3000 any more?")

I've only once worked for clients that had someone doing SEO who was deep into technical SEO and had all kinds of automation set up and roughly understood how HTTP works, the differences between server and client etc.

Much of it is quasi-religious, I've often jokingly suggested trying to sacrifice a chicken. I'm not sure everybody laughed and never gave it some serious thoughts.


Any general recommendations for SEO blogs?


I don't know, I don't read any :)

I get involved when the SEOs have decided what to do, but I don't keep up with trends and developments.


It is reasonable to assume there is some cross-contamination in search engine development between the two, and therefore yandex's factors might be useful for google as well.


Probably, but I don't think there's really any secret sauce they're sharing. Many of the factors are things every developer would come up with if they sat down and asked themselves "what would I do if I wanted to rank these results?".

Of course, the "how exactly is it implemented", "what weight does this factor have", "can I exploit how it works" etc is another story, but if you started building a search engine and wrote down your ideas about ranking and approaches to deal with some manipulation attempt etc, I'm sure you'd come up with a lot of the same things.


not really, 90% of SEO is backlinks/site authority. You can post the same article to a new site and not rank at all, or post it to a stronger old domain and have it rank #1.

Google got a lot of flak over the fake news stuff so they've gotten even more conservative with what they rank in recent years. I think that's a big reason people have noticed results getting worse, there is a lot of really great niche stuff on small blogs but Google would rather rank more generic content from an established website. A few major sites like CNET have already gotten caught churning out AI produced stuff and Google still ranks it due to site authority


i always wonder why google doesnt offer a downvote button for obvious bad links. the only reason i can think of is vote bombing, but if you contain that for personalization and do basic bomb detection you could greatly improve the xperience for people


A link that is bad for you might serve a lot of ads for a lot of people, which is good for company that has most of its profits from commissions on that process (Google).

Here's the shortest explanation of why you'll never get anything better.


Google already wants a working phone number for majority of users, making it logged only and leave it to the community to decide. But it will conflict with its "paid" programs so I guess it will never happen.


SERP = search engine result page

Apparently. Had heard of SEO, but not SERP




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: