Day one of "writing a reasonable search engine" would be to kill sitespam, no? Any time you see N sites with the same or very similar content, you should assume immediately drop the rank of all but a small slice at the top. You can choose to keep the one with most trusted incoming link (regular pagerank) but also you could just trust the oldest one in your index!
For example: there is now an epidemic of StackOverflow clone sites. They just post the SO answers with their own ads. But I don't want that site. So how on earth can Google show the clone sites on top of the true StackOverflow?
You'd think they have systems in place with hundreds of thousands of canary queries and "known expected rankings" such that IF one of the fraud sites manage to trick their system, they can just swiftly patch it to restore order and bury the clone sites after page 100. But no.
Here I took a random SO post and searched for a sentence in an answer. This result is in the top 5 results (in this case the real SO result was above - but not rarely the impostor sites are above)
I think it depends on location, browser config and other things. As a another comment said, I generally get SO pages at top, but these pages with same title appear next. I thought they were different forums where there were different answers but they were just SO clones.
I just did a search for a python problem and got a stackoverflow clone on the front page. From a few different searches I found a clone with most of them. They weren't at the top, usually after 5th place. I've also seen sites that have just copied github issue posts.
Not OP, but you can replicate probably very well by just copying any sentence from an SO question and pasting it into google. You will find duplicates. If not, try find a more unique sentence in that question/answer, especially with a weird way of speaking. In my example below, the tell was that legitimate NLP re-writes of the question didn't include "and do what ever you want", so including it found lots of clones.
They are being intelligent now and using NLP to mix-up the content, but it's very much the same question or answers, or just the answers, or some variation of the two, or made to look like a forum with SO comments as forum replies, etc. Most of it is non-nonsensical if you try understand it.
That was just what I could glance from the preview and all on the first google hits page.
I 100% guarantee that if you wacked 95% of the above domains and forever banned whoever registered them legitimately from the web forever that you'd make the web a better place.
Googling "What you can do is set the FormBorderStyle property to None and do what ever you want with the form using GDI" gives me only 5 clones on the first page and 4 of them are blocked by uBlacklist. The stackoverflow result is above the clones.
For example: there is now an epidemic of StackOverflow clone sites. They just post the SO answers with their own ads. But I don't want that site. So how on earth can Google show the clone sites on top of the true StackOverflow?
You'd think they have systems in place with hundreds of thousands of canary queries and "known expected rankings" such that IF one of the fraud sites manage to trick their system, they can just swiftly patch it to restore order and bury the clone sites after page 100. But no.