Hacker News new | past | comments | ask | show | jobs | submit login
Bing Jail (wieckiewicz.org)
187 points by idarek on April 23, 2023 | hide | past | favorite | 56 comments



I don't know what happened this year specifically but DDG results in general have abismally tanked in quality. A majority of the first 10-20 results are entirely off topic. It's become impossible to make specialized search. Negative queries no longer filter all unrelated results.

Lately I've been using phind but for general knowledge or ambiguous problems. I'm afraid I'm being forced to go back to google for searches...


I want DDG to succeed, but I'm worried they missed their window of opportunity. Being a thin wrapper on Bing will only get you so far... I was hoping they were busy building their own search engine the past several years, but it doesn't seem they have been.

People noticed their traffic was trending downward last year, and they responded by removing their publicly facing traffic page entirely (formerly here: https://duckduckgo.com/traffic).


Specifically:

* Between 2022-06-02 (https://web.archive.org/web/20220602142058/https://duckduckg...) and 2022-06-08 (https://web.archive.org/web/20220608024746/https://duckduckg...) they removed the graph. It was showing queries leveling off just under 100M/d.

* Between 2022-11-18 (https://web.archive.org/web/20221118045948/https://duckduckg...) and 2022-12-06 (https://web.archive.org/web/20221206080717/https://duckduckg...) they took the page down entirely.


That's not a good sign.

> I was hoping they were busy building their own search engine the past several years, but it doesn't seem they have been.

I assumed this as well. If Google could do this in the late 90s and iterate into the 2000s why can't DDG be able to do it over a decade in the 2020s?

Maybe DDG has figured out some business model and don't want to mess with it or something.


> If Google could do this in the late 90s and iterate into the 2000s why can't DDG be able to do it over a decade in the 2020s

Well, part of it is there is a lot more content to index now, which means you need a lot more resources to get started.


> Being a thin wrapper on Bing will only get you so far... I was hoping they were busy building their own search engine the past several years, but it doesn't seem they have been.

Indeed, in the startup world it is wide knowledge that you should buy whatever you can buy - never build. I'm guessing DDG took this approach also. At some point though, I've seen companies get to where aren't actually offering anything of value. Brand value can be high so some companies can float on brand alone, but it's quite a risk. Especially in DDG's case, relying on a third party (especially one that directly competes with you!) is a dangerous practice. By all means leverage the data you have access to, but if you're not improving significantly on it or aggregating or something, you're just rolling dice.


I've also tried to like it many times and used it as my default search engine, but the results are simply worse. I do cheer for them though but I'm not holding my breath. Being privacy focused only goes so far.


> Negative queries no longer filter all unrelated results.

This has been bothering me as well, but has been the case for as long as I know. For those who haven't seen this yet, try:

    "hacker" "news" -"hackers"
The results still include the word "hackers". It's like, the operator had one job, why offer this operator if it's going to ignore it? Note that the quotes are so that it definitely takes every word literally, but you get similar results if you unquote any/all of the words so it doesn't really matter.

I don't remember an example query, but in the cases where I needed it the most (it's relatively rare that I have no idea what other keywords might occur and I have to fall back to this negative selection in the first place) it seems to work the least well, often more than half of the results (starting with the top result) include the forbidden word(s).

There was also a case where more results included the forbidden word after adding it as an exclusion criterion. I should start writing these down...


Working nicely on Brave Search. I don't use the operators much but I do notice the difference with and without `- "hackers"`


If I translate the main page from Polish, it has text that resembles what you'd expect from scams: "Quidco - collect your £10", "Wise - transfer £500 for free", etc. There's even a bit of preface text that seems to acknowledge some content sounds a bit a like a scam: "I've never been too sure about sites that offer so-called Cashback"


Seems to be a reference to this blog post?

https://dariusz.wieckiewicz.org/quidco/

The big question is if Bing will reindex it later, assuming this is the problem content (who knows).

Sidenote: What is "cashback shopping"? I'm familiar with credit cards that do that but not websites. Is it scammy?


> I'm familiar with credit cards that do that but not websites. Is it scammy?

If you buy products from a store via a referral link, the referee gets paid a portion of the price. Cashback sites are organised around this principle, with referral agreements with a large number of stores. They share some of the earnings with the buyer. They are commonly used in the UK.


This is why the position of Bing annoys me. Nobody checks the content and just assumes. Quidco is not a scam and my article describes how to get what you deserve. Quidco is nothing different to coupons for loyalty programs given to customers.


It doesn't even matter what the content says anyway (and I personally wasn't going to read a full article I was just curious about cashback schemes), the search spam systems will always be automated and lack wider context.


I can understand your frustration. These days an LLM can arbitrarily blackhole you, effectively becoming judge jury and executioner. You cannot even reverse the mistake, it’s all automated and Orwellian.


I don't think Orwellian is the right word here. This makes it sound like is a system that has malicious intent against you. I don't think that is the case here...

Human systems were never designed to be as inner connected as they are now. I'm not talking about computer networks, I'm talking about our social systems and things like dunbars number. How do you codify one cultures social expectations in a system that works over the world wide web? How do you design a system that can be attacked by anywhere in the world at any time? If you're connected, you have one degree of separation to everyone else.

A system like this will experience such a huge amount of bullshit asymmetry that the cost of electricity and bandwidth to determine your truthiness quotient is a significant operating cost. The first moment any deviance is detected in your response you will be banned simply because the system cannot afford to do otherwise.


These are links to the article when I describe Quidco and Wise, additionally that you can get money back, like from GiffGaff in U/K you get £5, from Quidco £10 and so on. None of these described solutions is a scam. I always recommend something that I tried on my own and tested first. I am using Quidco and Wise constantly, hence if somebody classed my side as spam just based on these two promo links to the full article, they simply do not do their job and do not read the text that is behind them.


A short story on how my website lost all indexed pages in Bing, affecting my presence in DuckDuckGo, secret blacklisting and classification as a splog by Bing AI


Sites are delisted from Bing & DDG as a result of negative SEO attacks. I've experienced this multiple times and responded to a couple of threads on HN about it.

The attacker doesn't care about your Bing listing, it's an unintended coincidence of an attempt to get your site deindexed from Google search that doesn't work.

Google ignore bad links to your site, Bing don't.

After two years I can confidently conclude that MS have done a great job of hiding this 'defect' and virtually nothing regarding the root cause.

In my experience, the sites come back on their own eventually but with a lot of really weird URLs being indexed first (a sure sign of the original attack).


Wow, TIL "negative SEO attacks."


Has happened to my site as well, I gave up after doing around the same thing.. https://www.bing.com/search?q=site%3Apdf.to


I think that is not the correct syntax. I just search for redhat.com and suse.com

https://www.bing.com/search?q=site%3Aredhat.com

https://www.bing.com/search?q=site%3Asuse.com


It must be doing some kind of personalization for you, because when I load those two links in an incognito window, the first page is made up entirely of links to the constrained domains. That differs from the parent's link which in an incognito window just displays some kind of bing-looking error message


Seems like a great extortion business

/s


Old black hat SEO. Was never in that industry too seriously, but I did keep looped in my tech news as a kid as it gave me another perspective on the growth of the internet


Very good article -- I am having the exact same issues (I run a lightweight writing site with original writing and no ads). One thing that caught my attention is the timing. Bing blacklisted my domain from web search results on January 14 after I had no issue with them for 2.5 years (interestingly, I wasn't fully blacklisted for image results but that doesn't help me much). The rest of your article, from lack of support and information to Bing Webmaster only identifying minor SEO issues has been my experience. My site has no indexing issues with Google, Yandex, Brave, or Mojeek... so whatever it is is Bing-specific. I agree with your point on DDG too. I also had more DDG traffic than Bing traffic, and losing a smaller number of Ecosia, Qwant, and Yahoo referrals is also unfortunate. Many people do not realize how many of the alternative search tools rely on Bing's index.

Whatever is causing it -- I hope it is fixed soon.


Thanks!


Try to contact Mikhail (Bing CTO, @MParakhin) or Michael Schechter (Bing Growth VP) on Twitter. They are very helpful. Or Bing Head of Product, Jordi, which is very helpful on his email as well, jordir at Microsoft dot com.


That might have been more useful as a title (shortened, obviously, but longer than the two words that mean nothing together as it is now)


Bing crawler does in my limited experience seem pretty picky... It doesn't even seem to be able to follow HTTP->HTTPS 308 redirects on one of my sites (it has successfully indexed another) - it makes requests to http:// for my site, gets a 308 redirect to https://, and then does nothing with that, other than robots.txt requests.


because thats not the right way to do that. just look at the site you are on right now:

    > curl -i http://news.ycombinator.com/
    HTTP/1.1 301 Moved Permanently
    Location: https://news.ycombinator.com/


308 is the right way to do it, 301 is the old and slightly buggy way. Specifically, 301 lets a POST get redirected as either a GET or a POST, “for historical reasons”, while 308 guarantees no changes. 308 is not exactly edgy, either, it’s been supported by every browser since 2015.


As far as I'm aware (I have read up on it), 308 is a newer and just as acceptable way of doing it, and in fact it's what Caddy (the webserver I'm using) uses for its built-in HTTP->HTTPS redirect functionality...

https://www.rfc-editor.org/rfc/rfc7538

It does seem possible/likely that Bing doesn't understand this though and only understands 301...


I notice that it's not just bing, why on this very site I see that 18 of your 22 submissions (all within the past 13 months, all to your blog) are marked as '[dead]'.


Random guess: maybe because people see that all submissions are self promotion and they flag it?


'[flagged]' appears to be a strict subset of '[dead]'. Trawling thru the 'New' tab it looks like '[dead]' by itself appears to be a soft ban on the site ('soft' because any individual link is salvageable by votes)


But submissions here doesn't have anything to do with Bing indexation.


I’ve been battling a case like that for about a year. That’s disheartening of course; however, what’s worse was that Bing was featuring spammy proxied copies of my website on their first pages!

I’ve reported that and lost lots of time and hope that something can be done.


As an aside, MalwareBytes blocked this page. It may have ended up on some blocklist?


Could you share some more information? You can send me an email with a screenshot. My site not serving any ads, does not use any shading tactics, pure HTML and CSS website with just a minimum of JavaScript.


Your domain seems to be on the following blacklists:

- Some kind of anti-annoyance filter for (for reCAPTCHA or something?) https://filters.adtidy.org/windows/filters/237.txt

- An anti-anti-adlock list: https://github.com/bogachenko/fuckfuckadblock/blob/master/fu...

- Something about some kind of Indian ISP (?) injecting code into domains: https://github.com/sudotman/indianadblock/

If one of these lists got picked up by an abuse filter somehow, that'd explain why search engines and malware filters would throw a fit about your site.


Thanks. Thats a total nonsense. Why somebody would add my domain to this list when I not displaying any ads for year!


I've had this exact same issue for the last year, Bing support have been useless. I'll do the same and write a blog post


Share the link if you do. Would like to read through your thoughts.



As noted in the thread about DDG removing the ability to filter out search terms [0], Microsoft announced in February that the search API is getting much more expensive on May 1 2023. This is likely shifting features to meet what they can afford.

[0] https://news.ycombinator.com/item?id=35683254


Yeah this happens all the time and sites will just randomly get indexed again. Happened to my large site after a few months. Definitely some massive error they haven't bothered looking into

The same thing happens with Pinterest, too. They will just randomly ban sites and hide pins that link to them. Then they come back later


But DDG is an independent search engine, with its own crawler and multiple sources of search results! /s


Their main results are based on Bing.


A few weeks since Bing came back from the oblivion as a possible search contender and already we're getting the "big company can screw you over and there's nothing you can do about it".

This was/is a problem with Google (search, youtube, gcp etc.) where people's income relies on these services and the company just doesn't care. I've seen a bunch of posts on HN about problems like this where the comments are full of other people sharing similar stories.

It's depressing, but the takeaway is to be smart about keeping content separate between domains and being prepared for the hammer to fall at any moment.


>This was/is a problem with Google (search, youtube, gcp etc.) where people's income relies on these services and the company just doesn't care.

I'm curious as to whether or not you think Google should care.

I've seen many, many cases (as I'm sure, have you) of Google (as well as all the corps that got fat on the backs of their products'^W users' PII, browsing histories, social connections and personal health information -- what do you think Google is doing with all that non-HIPAA protected data from FitBit? --, not just Google) making sure their customers are well taken care of, at the expense of their users.

They never seemed to care before, why should they start now? In fact, if they did start to care about their users, management would likely be ousted tout de suite, as it would negatively impact the stock price and profits.

Google is a profit-making organization and, as such, they rightly (from the perspective of investors/Wall Street) optimize for profit, regardless of the impact to users/impact on their users' profits.

Does that stifle commerce and innovation? Absolutely. But Google (and pretty much all for-profit organizations) only care about the health of their own businesses, not anyone else's.

In an effort at 'reductio ad absurdum'[0], I posted this[1] in a different discussion here:

   If flooding every gmail user's inbox with Goatse[1] every six minutes would 
   increase revenue (and/or stock price), a billion people would be intimately 
   familiar with a stranger's rectum. And GOOG would laugh all the way to the 
   bank.
The long and short of it (at least IMHO) is that you shouldn't expect Google to act against its interests, and those are, more often than not, not that of their users. Ergo, folks (especially those trying to make a buck) should vote with their feet and shun Google, not just shrug and try to mitigate the risks of running afoul of Google's interests.

[0] https://en.wikipedia.org/wiki/Reductio_ad_absurdum [2]

[1] https://news.ycombinator.com/item?id=35555781

[2] Upon further reflection, the actions I mention aren't all that absurd, rather the object in the example I gave of those actions are so.


Does MS give/sell crawl data to OpenAI?


Officially OpenAI uses CommonCrawl, or at least used to do so.


OpenAI used Common Crawl for GPT-3. For GPT-4 they don't say anything.


Cool, thank you for that tidbit of information!


It appears that this business model could potentially involve unethical practices that rely on extortion tactics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: