Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Someone is proxy-mirroring my website, can I do anything?
466 points by stanislavb on Dec 12, 2022 | hide | past | favorite | 291 comments
Hi Hacker News community,

I'm trying to deal with a very interesting (to me) case. Someone is proxy-mirroring all content of my website under a different domain name.

- Original: https://www.saashub.com

- Abuser/Proxy-mirror: https://sukuns.us.to

My ideas of resolution:

1) Block them by IP - That doesn't work as they are rotating the IP from which the request is coming.

2) Block them by User Agent - They are duplicating the user-agent of the person making the request to sukuns.us.to

3) Add some JavaScript to redirect to the original domain-name - They are stripping all JS.

4) Use absolute URLs everywhere - they are rewriting everything www.saashub.com to their domain name.

i.e. I'm out of ideas. Any suggestions would be highly appreciated.

p.s. what is more, Bing is indexing all of SaaSHub's content under sukuns.us.to ¯\_(ツ)_/¯. I've reported a copyright infringement, but I have a feeling that it could take ages to get resolved.




Same thing happened to me and my service (https://next-episode.net) almost 2 years ago.

I wrote a HN post about it as well: https://news.ycombinator.com/item?id=26105890, but to spare you all the irrelevant details and digging in the comments for updates - here is what worked for me - you can block all their IPs, even though they may have A LOT and can change them on each call:

1) I prepared a fake URL that no legitimate user will ever visit (like website_proxying_mine.com/search?search=proxy_mirroring_hacker_tag)

2) I loaded that URL like 30 thousand times

3) from my logs, I extracted all IPs that searched for "proxy_mirroring_hacker_tag" (which, from memory, was something like 4 or 5k unique IPs)

4) I blocked all of them

After doing the above, the offending domains were showing errors for 2-3 days and then they switched to something else and left me alone.

I still go back and check them every few months or so ...

P.S. My advice is to remove their URL from your post here. This will not help with search engines picking up their domain and ranking it with your content ...


Might I suggest a spin on this: instead of blocking the IPs, consider serving up different content to those IPs.

You could make a page that shames their domain name for stealing content. You could make a redirect page that redirects people to your website. Or you could make a page with absolutely disgusting content. I think it would discourage them from playing the cat and mouse game with you and fixing it by getting new IPs.


One possibility: Serve different content, but only if the user agent is a search engine scraper. Wait a bit to poison their search rankings, then block them.


... be careful with this.

Assuming you've monetized your content with ads, depending on your ads provider, this may have deleterious effects on your account with that provider, as they may then assume you're trying to game ads revenue.


The mirror is almost certainly running their own ads, given they strip the JavaScript out.


I've tried this with zip bombs, but I can't tell how well it worked out.


Wait what? Care to follow on this hypothetical topic please?


zip bombs are files that when unzipped expand to enormous sizes. I'm not sure if OP put one to be downloaded for the offender to kill their disk space, or if you could stream one hoping the client browser/scraper would attempt to decompress and crash for memory or disk outages?

That's my read on it anyway.


Did the same things for spam bots :p


> Or you could make a page with absolutely disgusting content.

Not if you value the people who might move to the real domain.


You could do this without effecting normal traffic depending on uniqueness of ip doing the scraping.

Love the idea.


I think you missed the point - if people show up at $PROXY expect nice stuff but see junk, then they won't move over to $REAL and instead blame $REAL.

E.g. you'd like some way to redirect people from $PROXY site to $REAL site, and disgusting content on $PROXY won't do that - it'll reflect poorly on $REAL


If you can identify the crawler - you can provide 'dynamic' content for that specific user context.


It's a proxy, so there's no "crawler". It's just an agent relaying to the user. Passing something to this proxy agent just passes it directly to the user.


If those IPs are VPN services, you might be negatively affecting all VPN users in addition to the proxy.


"Or you could make a page with absolutely disgusting content." You've never heard of Rule 34, have you...


obviously somebody too young to have seen the method of using an http redirect to the goatse hello.jpg for unwanted requests

edit: or when somebody embed-links your image inside some forum, replace the original filename with the contents of hello.jpg


As soon as you have a few of their IPs, look them up on ipinfo.io/1.2.3.4 and you'll find they probably belong to a handful of hosting firms. You can get each firm's entire IP list on that page and add all of those CIDRs to your block list. Saves you needing to make 30K web requests.

In most countries in the western world, there are 3-4 major ISPs and this is where 99% of your legit traffic comes from. Regular people don't browse the web proxying via hosting centres as Cloudflare will treat them with suspicion on all the websites they protect.


The site seems to be hosted on OVH cloud. OP should report this to them.

https://www.ovh.com/abuse/

Found the hosting information from here: https://host.io/us.to


Consider reaching out to Afraid.org first, https://freedns.afraid.org/contact/

They are the ones providing the subdomain


THIS ^^


For 2) you mean you loaded it from the adversary's proxy site, just to clarify?


Yes, constructed the honeypot URL using the proxy site and called it (thousands of times) so I can get them to fetch it from my server through their IP so I can log it.


They literally proxy your website? I thought they'd cache it... that makes more sense now in your statement that you hit their website with a specially formatted url. Since they pass that through to you you can filter on that.

Also: since you say 4k-5k IPs... any of them from cloud providers? And specific location?


No cloud providers as far as I'm aware.

They were all from the same 4-5 ASN networks, all based in Russia.


If you happen to use Cloudflare.... Cloudflare -> Firewall rules -> Russia JS Challenge (or block)


Residential proxy botnet.


Why do they bother doing this domain proxy stuff in the first place?


High quality content with a good standing in Google => unique and quality impressions => more revenue from the ads they insert in the content.


There is also the potential to use it as a watering hole for more sophisticated or subversive measures where they subtly change what you post to promote something you don't actually promote (so at some point they deviate from pure proxy to mitm).


Also for (2), any worries that your own providers might imagine you're trying to mount some half-baked DOS campaign?


Wasn't really worried about that.

I didn't do it as a super quick burst, but in a space of multiple hours.

First because the proxy servers were super slow and second - I couldn't automate it - their servers had some kind of bot detection which would catch me calling the URLs through script.

Instead, I installed a browser extension which would automatically reload a browser tab after specified timeout (I've set it to 10 sec or something) and I opened like 50 tabs of the honeypot URL and left it there to reload for hours ...


Look out as this is not optimal.

Since they will fingerprint your browser. But it looks like they were people with low IQ, so you were fine.


Side note: great idea for a website. This could be really helpful. You got a new user here.


I have to agree, my SO has been looking for something like this for a long time. Signing up today!


Wow, hadn't seen this before. Awesome site!


Thanks!


>4) I blocked all of them

Don't block them. Show dicks instead


Once you have their IP addresses you can make them serve anything you want. Set your imagination free.

For starters: copyright-infringing material.


Unless you hold the necessary rights to the copyrighted material, that would make you a copyright infringer yourself.


How would they prove that when they label every content as it was their own?


Presumably they'll tell the copyright holder that sue them where they got it from, and provide evidence for that, and then the copyright holders will (also) sue the original source.


Makes me wonder if you could switch serving content based on the URLs. So they redirect back to your website. Or display images marked as copyrighted.


I tried but couldn't redirect back to my website as they stripped / rewrote all JS.


You could have a "stolen content" pure HTML/CSS banner that gets removed by Javascript. Only proxy site visitors will see the banner because the proxy deleted the Javascript.


some people like me will see the "stolen content" banner on the original website. And attackers can trivially remove it as soon as they get aware of it.


Would it be possible to hide a hash/encoded URL somewhere in JS and delete the site/redirect if the hash/encoded URL contained something unexpected?


Thanks for the advice. I will give a go to some of these. p.s. I can't remove the URL as the post is not editable anymore. I'm just waking up... in Australia.


The mod can though, if you email him at hn@ycombinator.com.


8chan like every forum ever has dumb moderators who dont know how to do their job / over extend their hand (and the moderation position of web forums seems to attract people with certain mental disorders that make them seek out perceived microinjustices which the definition thereof changes from day to day)

there were a bunch of sites mirroring 8chan to steal content

these were useful because they had both a simpler / lighter / better user interface (aside from images being missing), and posts / threads that were deleted would stay on the mirrors. being able to see deleted posts / threads was highly useful as the moderation on such sites tends to be utterly useless and the output of a random number generator. it was hilarious reading "zigforum" instead of "8chan" in all the posts as the mirror replaced certain words to thinly veil their operation. they even had a reply button that didnt seem to work or was just fake.

tl;dr the web is broken and only is good when "abused" by proxy/mirrors


Instead of blocking by IP, just check SERVER_NAME/HTTP_SERVER variables in your backend/web server (or even in JavaScript of the page check window.location.hostname) and in case those include anything but original hostname, redirect to the original website (or serve different content with a warning to the visitor). If you have apache2/nginx this can be easily achieved by creating a default virtualhost (which is not your website), and additionally creating explicitly your website virtualhost. Then the default virtualhost can have a proper redirect while serving any other hostname.

Those variables are populated by the browser, unless proxying server is rewring them, your web-server will be able to detect imposter and serve him/her with a redirect. If rewrites are indeed in place, then check in the frontend. Blocking by IP is the last option if nothing else works.


As the OP mentioned, JS is stripped and URLs are being written, so I doubt either of those approaches will work.


Making js essential is not that hard, right? Just "display: none" on the root element, which is removed by js :)

More sophisticated options can been found in other comments.


Forcing all users of your website to use JavaScript to get around a scammer is pretty heavy-handed.


Presumably OP would only have to do this for a limited time, until the scammer gives up and moves on to an easier target. It's not the best, but I don't think it's as bad as you say.


Just explain why in a way that vanishes with JS enabled. Like other have said it'll not need to be used for long.


most websites these days already use javascript, and all modern browsers already support it. unless you're some really niche turbonerd website, nobody is going to notice or care...


Show me one website that today really works without javascript.



I've been surfing without javascript since 2015. Most websites continue to work fine without it (though some aesthetic breakage is pretty standard). About 25% of sites become unusable, usually due to some poorly implemented cookie consent popup. I don't feel like I'm missing out on anything by simply refusing to patronize these sites. I will selectively turn JS on in some specific cases where dynamic content is required to deliver the value prop.


Same, I even wrote a chrome extension to enable js on the current domain using a keyboard shortcut; but it has gotten to be more of a pain especially on landing pages.


> Most websites continue to work fine without it

> About 25% of sites become unusable

These two statements seem pretty contradictory. 75% feels like a low threshold for "most."


Most is more than half.


In casual conversation, I would never interpret most as being solely more than half. However, it seems like perhaps most people agree with you :)

https://english.stackexchange.com/questions/55920/is-most-eq...


In my entirely casual understanding of English most means the set that has more members than any other. When the comparison is binary (sites that work vs sites that don't) then "more than half" is both necessary and sufficient as a definition.

When comparing more than two options most could be significantly less than half (e.g. if I have two red balls, and one ball each of blue, purple, green, orange, pink, and yellow, then the color I have the most of is red, despite representing only one quarter of the total balls.)

That said, any attribute attaining more than half of the pie must be most.


In retrospect, the never in my previous comment was certainly an overstatement. While I agree with your reasoning, there is often a distinction between technically correct use of language, and what the hearer is likely to understand from what is said.


Even JS-heavy websites are moving towards being usable without Javascript with server side rendering.


The other kind of problem is if the website is not really proxied but rather dumped, patched and re-served. In such case the only option (if JavaScript frontend redirect doesn't work) is blocking by IP the dumping server.

To identify IPs, as pointed in the root comment of this thread, you can create a one-pixel link to a dummy page, which dumping software would visit, but a human wouldn't. So you will see who visited that specific page and block those IPs for good.


I would think you'd want to be careful about search engines with that approach. Assuming the OP wants their site indexed, you could end up unintentionally blocking crawlers.


Tail wagging the dog is never a good answer.


It's trivial to strip that "display: none" out, too.


Yea if they're already rewriting content to serve ads (likely since they're probably not doing this for altruistic reasons) you're just putting off the inevitable. While blocking or captcha'ing source IPs is also a cat and mouse game it's much more effective for a longer period of time.


Maybe an html <meta> redirect tag that bounces through a tertiary domain before redirecting to your real one? If they noticed you were doing it they could mitigate it, but they might deem it too much effort and just go away.

You might also start with the hypothesis that they're using regex for JS removal and try various script injection tricks...


If they're already stripping JS, I can't imagine it would be a lot of work to also remove the <meta> redirect.


1. Create fake url endpoint. And go to that endpoint in the adversary's website, when your server gets request, flag the ip. Do this nonstop with a script.

2. Create fake html elements and put unique strings inside. And you can search that string in search engines for finding similar fake sites on different domains.

3. Create fake html element and put all request details in encrypted format. Visit adversary's website and look for that element and flag that ip OR flag the headers.

4. Buy proxy databases, and when any user requests your webpage, check if its a proxy.

5. Instead of banning them, return fake content (fake titles and fake images etc) if proxy is detected OR the ip is flagged.

6. Don't ban the flagged ip's. She/He's gonna find another one. Make them angry and their user's angry so they give up on you.

7. Maybe write some bad words to the user on random places in the HTML when you detect flagged ip's :D So the user's will leave the site and this will reduce the SEO point of the adversary. Will be downranked.

8. Enable image hotlinking protection. Increase the cost of proxying for them.

9. Use @document CSS to hide the stuff when the URL is different.

10. Send abuse mail request to the hosting site.

11. Send abuse mail request to the domain provider.

12. Look for the flagged IPs and try to find the proxy provider. If you find, send mail to them too.

Edit: More ideas sparkled in my mind when I was in toilet:

1. Create fake big css files (10MB etc). And repeatedly download that from the adversary's website. This should cost them too much money on proxies.

2. When you detect proxy, return too big fake HTML files (10GB) etc. That could crash their server if they load the HTML into the memory when parsing.


I like how you think. These are all great ideas!

Reminds me of a time some real estate website hotlinked a ton of images from my website. After I asked them to stop and they ignored me I added an nginx rewrite rule to send them a bunch of pictures of houses that were on fire.

For some reason they stopped using my website as their image host after that.


Is the primary motivator to do this?

I'm curious if they are stealing anything else, e.g. are they selling ads/tracking, do they replace order forms with their own...


because I asked them to stop doing it, and they didn't. Technically they were stealing my bandwidth.

Also to teach them an important lesson about the internet.


haha, they're just lucky you didn't introduce them to Goatse


well actually...

there was another time a site hotlinked to a js file. After asking them to stop, i found that they had a contact form with a homebrew captcha which created the letters image like http://evilsite.com/cgi-bin/captcha.jpg?q=ansr

A little while later, their captcha form had a hidden input appended with the correct answer value, and the word to solve was changed to a new 4 letter word from a dictionary of interesting 4 letter words. The form still worked because of the hidden input. I might have changed the name on the "real" input also.


Signal boosting suggestion #1 here. Great idea.

Additionally if they decide to blackhole the fake/honeypot url, since you mentioned they pass along the user agent, you could mixin some token in a randomized user agent string that your scraper uses so that you could duck-type the request on your end to signal when to capture the egress ip.


#5 and #6 are key. Don't try to block them directly, just get them delisted. When you've worked out a way to identify which requests belong to the scammer, feed them content that the search engines and their ad partners will penalize them for.


Bummed that I can upvote this only once. Excellent work.


LOL! Thank you for the laugh. This is great.


What a sure-fire way to toast them! Kudos!


In my search for this I found @document isn't super supported [0] I suggested something like:

    a[href*= "sukuns.us.to"] {
     display:none; 
    }
Then use SRI to enforce that CSS.

[0]: https://caniuse.com/mdn-css_at-rules_document


How about something like...

    body[href*= "<OFFENDING URL>"] {
        background-image: url("http://goatse..."); 
    }
Ala: http://ascii.textfiles.com/archives/1011


Or just make the whole page rotate

    body[href*= "<OFFENDING URL>"] {
      animation: rotation 20s infinite linear;
    }

    @keyframes rotation {
      from {
        transform: rotate(0deg);
      }
      to {
        transform: rotate(359deg);
      }
    }


We're trying to punish the people running the proxy mirror, not the users who stumble upon them just trying to use the site


You could look at it as trying to get them blocked by search engines. Can you detect when they're proxying a search bot as opposed to a user? As for punish, you don't have to make it eye-bleach, just enough to make it firmly NSFW so nobody can get any business value from it, or even use it safely at work.

A little soft NSFW would also greatly accelerate them being added to a block list, especially if you were to submit their site to the blocklists as soon as you started including it. You can include literally anything that won't get you arrested. Terrorist manifestos, the anarchists cookbook, insane hentai porn... Use all those block categories - gore/extreme, terrorist, adult, etc.


In that case, write some JS, that wanders around the Hubble site, randomly downloading full-res TIFF images for the background, or that randomly displays Disney images.


Seems like it would be fairly easy to use this pseudo selector, and apply it to every element on the page. Making them show up as empty to the user


You could add a data attribute to the html tag of the document with the current URL, I.E.

  <html data-path="https://www.saashub.com/about">
then hide the full page with:

  html {display: none;}
  html[data-path*="saashub.com"] {display:block;}


This seems quite elegant and easy. Obviously in addition to other measures, but I like it.


Honestly this is my favorite HN post in a while I've had a lot of fun thinking over this challenge.


I'm with you, too!


I know this is just a game that never ends, but if they're already rewriting the HTTP requests what's stopping them from rewriting the page contents in the response?

SRI is for the situation where a CDN has been poisoned, not this.


It might not explicitly be what SRI is meant for but it'll narrow the proxy's options to:

A. Blank page

B. Let the find and replace update the CSS. Generate new hashes in the HTML.

C. Find someone new to pick on.

B is time and potentially computationally expensive, so it makes C a better option.


A doesn't work because B doesn't prevent the attacker from regexing out the hash altogether and changing the domain name in the tags to their own.


If they're rewriting html, I guess sanitizing css won't be beyond them.


Shadow nefarious techniques are the best. Don't give them clear indications that there is a problem.

For example, I had an app developer start stealing API content, so once I determined points to key from them, instead of blocking them I simply randomized the API content details returned to their user's apps.

Hey, API calls look good, the app looks like it is working, no problem right? Well, the users of the app were pissed and the negative reviews rolled in. It was glorious.


Serious question — is there a way to defend from this "stealing the API" thing? E.g. building an authentication of some sort and then including a key with your app?


Of course HN doesn’t like anything that’s reminiscent of DRM, but Apple’s App Attest and Google’s Play integrity API can help dispense online services to valid clients only.


These are the best ideas, especially SEO poisoning and alternate images. If their point is to steal content and rankings then poisoning the well should discourage this in the future. I suspect their actual goal is to have a low-effort high SEO site to abuse as a watering hole for phishing attacks.

As a side note, their domain is linked in this thread so they are seeing HN in their access logs and probably reading this. It should make for an interesting arms race. Or red/blue team event.


They said the attacker was passing through the client's user agent. If they get a user agent that is GoogleBot, they could check if the requesting IP is actually a valid Google data centre (there is a published list of IPs). If the IP is not Google directly, they could return a blank page therefore causing Google to index nothing through the mirrored site.


This is a good idea, though it may be short lived since the attackers are likely reading this due to the referrers in the logs. They may add an ACL to counter this but it might be interesting to see how long that works.


Seems like a good use case for a zip bomb. Return some tiny gzipped content that expands to 1gb.


Yeah. Their proxy is parsing the HTML and stripping it / modifying it, so they're obviously unzipping the responses on their servers. Create the honeypot endpoint, and if you get a request from that endpoint, reply with a zip bomb.

Then, write a little script that repeatedly hits that honeypot URL. I quite like this idea.


Awesome, do post a follow-up on HN, I want to hear how this war with the proxy asshats plays out.


> 5. Instead of banning them, return fake content (fake titles and fake images etc) if proxy is detected OR the ip is flagged.

> 6. Don't ban the flagged ip's. She/He's gonna find another one. Make them angry and their user's angry so they give up on you.

There's a popular blog that no longer gets linked on HN.

The author didn't like the discussions HN had around his writing, so any visitors with HN as the referer are shown goatse, a notorious upsetting image, instead of the blog content.


Goatse? I assume you're referring to jwz - that blog shows a testicle in an egg cup if it sees a HN referrer.


Yeah, jwz. Looks like I got mixed up - goatse has been a popular choice for this kind of thing, but jwz went with a different image.

Fortunately, there are many upsetting images for the OP to choose from!


Out of curiosity, which blog are you talking about?



Does anyone not have their referer header supressed or faked?


I strip the referrer generally via https://wiki.mozilla.org/Security/Referrer, unfortunately it breaks a small number of sites very badly, such as web.archive.org and a few others. some of them claiming it was done to combat scraping.


Breaking is only part of the problem. The pages that rely on the referer header take it for granted and do not implement any meaningful error handling. They just die a horrible death, instead of responding with an error message stating that they need a referer.

One bad example is relying on the referer only for log-out, everything else works. That site also runs massive js on log-out, as if it really needs to rely on explicit log-out, and not just the user disappearing.


I have never considered faking or suppressing my referer header. I don't know why I would care. I suspect I'm in the company of well over 99% of all internet users.


Why return big files when you can return small files at excruciatingly slow speeds? modems are hot again!


that's probably the best advice. Instead of denying the proxy, just make it shitty to use for the end-user.


> Maybe write some bad words to the user on random places in the HTML

> Create fake big css files (10MB etc). And repeatedly download that from the adversary's website. This should cost them too much money on proxies.

Be careful when doing things like this, including the shock image option mentioned in other comments, as then it could become an arsehole race with them trying to DoS your site in retribution. Then again, going through more official channels could also get the same reaction, so…

> When you detect proxy, return too big fake HTML files (10GB) etc. That could crash their server if they load the HTML into the memory when parsing.

Make sure you are setup to always compress outgoing content, so that you can send GBs of mostly single-token content with MBs of bandwidth.


> Create fake big css files (10MB etc). And repeatedly download that from the adversary's website. This should cost them too much money on proxies.

Doesn't that also cost you an equal amount? You'll be serving them an equal amount that they proxy to the end user.

It's not even necessarily a cost for them; you're assuming that the host is owned and paid for by the abuser. If it's simply been hijacked (quite possible), you're just racking up costs for another victim.


I remember years ago there was a way to DDoS a server by opening the connection and sending data REALLY slow, like 1 byte a second. I wonder if there is a way to do the opposite of that, where ever request is handed off to a worker which slow enough to keep the connection alive. I doubt this can scale well, but just a thought.



The “opposite” thing you’re describing sounds like a tarpit: https://en.m.wikipedia.org/wiki/Tarpit_(networking)


you can have some fun with nginx if you can identify on your backend whether the request is coming from a malicious source, e.g. with X-Accel-Limit-Rate


I read once a suggestion to serve gzipped requests which, gzipped, are tiny, but un-gzipped are enormous. Like GBs of 0s.

Not sure how you actually do it and if it serves your purpose but sounded neat.


It's called a "zip bomb" (popularized by Silicon Valley [1]), and there is a good guide (and pre-generated 42kB .zip file to blow up most web clients) at https://www.bamsoftware.com/hacks/zipbomb/

[1] https://www.youtube.com/watch?v=jnDk8BcqoR0


Any recommendations on proxy database providers?


http://iplists.firehol.org/ looks free and very comprehensive. It has whole bunch of sub-lists of IPs that are likely to be sources of abuse, including datacenters and VPNs, and it gets updated frequently. Github: https://github.com/firehol/firehol


> 1. Create fake big css files (10MB etc). And repeatedly download that from the adversary's website. This should cost them too much money on proxies.

Nope, since anybody doing this and it has at least minimum intelligence are using residential botnets as proxies.


Going defcon3 on proxies

You can also write some obfuscated inline JavaScript that checks the current hostname and compares to the expected one and redirects when not aligned.


They are stripping all JS.


Passive Aggressive FTW. These are all fantastic ideas.


I really like #9, this seems like a simple way to make your site unusable except via the methods you desire.


Oh, I love these. I will use some of them. Many thanks!


Fake 10GB html can be a zip bomb?


point no.1 will do. that's the solution.


Add a link rel="canonical" to your pages as well, it should give engines a hint that your domain is the legit one.

https://webmasters.stackexchange.com/questions/56326/canonic...

I noticed that the other domain is hotlinking your images. So you can disable image hotlinking, by only allowing certain domains as the referers. If you block hotlinked images then the other domain will not look as good. Remember to do it for SVGs too.

https://ubiq.co/tech-blog/prevent-image-hotlinking-nginx/

Finally I also see they are using a CDN called Statically to host some assets off your domain. You can block their scrapers by user agent listed here:

http://statically.io/docs/whitelisting-statically/


I think they are replacing all mentions of saashub.com with their domain. Also, I'm not using statically.io, that's something they are prepending in front of all images. Automatically.


But Statically isn't forwarding the User-Agent of the visitor, and they publish the list of User-Agents that they use, which you can block.


Sometimes the replacement is done with simple pattern matching. Try different forms of encoding you domain to see if you can get through their replacement.


It's adding the CDN for some of the images but not all of them, so you'd have to cover both


Setup Cloudflare on the domain and turn on “bot fight mode”.

If the TLS ciphers the client proposes for negotiation doesn’t align with the client’s User-Agent they get a CAPTCHA.

I would suspect that whoever is doing this proxy-mirroring isn’t smart enough to ensure the TLS ciphers align with the User-Agent they’re passing through.


I would agree with the above, as an easier version of TLS fingerprinting. One could also ise nginx/haproxy to extract enough TLS info, and detect requests xoming through proxy Magic string: JA3 fingerprint


This is the correct first step.


On the free tier, does bot fight mode do anything other than simply detect bots based on JavaScript detections?


What about a slightly alternative approach, where instead of trying to block the abuser, you try to make it clear to end users what the real website is? E.g. in your logo image, include the real domain name "saashub.com". Have some introduction text on your home page "Here at saashub.com, we compare SaaS products ...." When your images are hotlinked, replace them with text like "This is a fraudulent website, find us at saashub.com". Anything that can make it obvious to end users that they're on the wrong website when they visit the abuser's URL.

By the way, I've also reported the abuser as a phishing/fraud website through https://safebrowsing.google.com/safebrowsing/report_phish/?u...


Not sure if this would help since:

> 4) Use absolute URLs everywhere - they are rewriting everything www.saashub.com to their domain name.


Embed the welcome text in an image then!


Try some things like sa(zero-width-space)ssh<b></b>ub.com


One strategy tip: don't play cat and mouse. As you've demonstrated, if you change one thing, they will figure it out and change one thing. Not only does that not work, but you are training them that it's worth trying to beat your latest change.

Instead, plot a few different changes and throw them in all at once. Preferably in a way where they will have to solve all of the changes at the same time to figure out what happened and get things working again. Also, favor changes that are harder to detect. E.g., pure IP blocks are easier to detect than tarpitting and returning fake/corrupted content. The longer their feedback loops, the more likely it is that they'll just give up and go be a parasite somewhere else.


> pure IP blocks are easier to detect than tarpitting and returning fake/corrupted content

I recently had to employ such a strategy against some extremely aggressive card testers (criminals with lists of stolen credit cards who automate stuffing card info into a donation form to test which cards are still working). Instead of blocking their IPs, I started feeding them randomly generated false responses with a statistically accurate "success" rate. They ran tens of thousands of card tests over many days, and 99% of the data they collected was bogus. It amuses me to know that I polluted their data and wasted so much of their time and effort. Jerks.


This warms my heart and it's a great example of lengthening the feedback loop.


I love it, also add a randomness, there is nothing more frustrating than a problem that only reproducers sometimes!


Excellent idea!


My networking knowledge isn't great, so apologies if this is wrong. But if it's not wrong, it could help.

FIND THE IP FOR THE DOMAIN

  PS > ping sukuns.us.to
  Pinging sukuns.us.to [45.86.61.166] with 32 bytes of data:
  Reply from 45.86.61.166: bytes=32 time=319ms TTL=39
  ...
REVERSE DNS TO FIND HOST

  https://dnschecker.org/ip-whois-lookup.php?query=45.86.61.166
Apparently it's "Dedipath".

And that WHOIS lookup gives an abuse email address:

  "Abuse contact for '45.86.60.0 - 45.86.61.255' is 'abuse@dedipath.com'"
So you could try emailing that address. They may take the site down, or hopefully more than that...


This is not a bad idea, though i would guess that if these guys change IPs, then it will be annoying to spend your time sneding emails, etc. But, then i thought: why not automate this with some simple scripts? You have al;ready outlined your recipe, so simply automate the steps...But the more i thought of the automation around this, you need to be creful not to turn into a "spammer of sorts, constantly sending emails...certainly, you wouild be sending legitimate emails, but if they change their IPs more often, that might trigger your automatiomn more often, somewhat turning you into a mild "spammer", right? :-) I'm not suggesting you abandon your apporoach, but simply to remember to not overdo it with big scale of emails sent out. ;-)


Aha, some more good ideas there! But you're right, there's tradeoffs and dependencies and uncertainties throughout, so it's not easy to even guess in advance what would work or be worthwhile. Plus as you say there could be negative consequences from a kind of arms-race, with the solution becoming a problem in itself.

It's not the same thing, but I'm reminded now of email in the past, when you would usually get an undeliverable message if something went wrong. But later that was almost entirely stopped - because of spam. Massive volumes of spam was sent from forged addresses, and much of it led to those replies. So that made things worse by doubling the volume, plus the innocents whose addresses had been forged got deluges of confusing undeliverable messages!

I think you're right in that changing IPs would be easy for them. But, changing hosts would be significantly more work and hassle. So if the abuse reporting worked, that could have much more of an impact...


Block all of the prefixes that their AS announces too: https://bgp.tools/as/35913#prefixes


Abuse contacts never work. I've never had any success hounding them about malicious sites they host.


I have almost no experience of this, and nothing recent, so I don't know. But I'm not surprised at what you say, given the amount of abusive stuff that happens online nowadays.


actually works very well when it's combined with DMCA takedown request.


Ah yes... Those stories we read where even a hint of a DCMA request results in a takedown, so that the host avoids the legal risk. That could be extremely effective in this case?!


They are probably using some public cloud service so simply banning all IPs from cloud ASNs [1] will usually be enough. Downside is you're also banning any users using VPNs

[1] https://github.com/brianhama/bad-asn-list


Another resource that can be used to check for abusive client IPs is https://github.com/firehol/firehol


Thanks, that seems like something I could work on if I can't find a better solution. Cheers.


Warning: Don't visit the proxy mirror at work, I was redirected to xcams/adult content.


The mirror is injecting their own ads. My guess is it was just a malicious ad forcing the redirect. It could still happen, but it doesn’t appear to be the main intention of the mirror site.


weird, wasn't the case here.


If the host (DediPath) is not respecting DMCA notices, one other thing you can do is adding the requester's IP address to every page, eg as a div class. If the responses are live proxied, this will surface the cloner's front-facing IP address, and you can block that (and their ASN) specifically.


To extend on this, I wouldn't use clear text for this. Create a HMAC of the IP and add it somewhere in the page, makes it harder to realize what's happening and for the adversary to work around it.


Oh, I like this idea. Would be pretty easy to automate it by setting up some script scraping the IP revealed on their site, adding it to the block list as they rotate around. Clever.


Wouldn't they be able to do the same preventing you from scraping the site? They may have many IPs to work with, but you may not?


I'm not sure I can understand your advice.


Add a comment (or attribute or JS with a string literal) to your HTML that contains IP address of whoever requested the page. Obscure it somehow so it's not obvious that the HTML contains the IP address. Then check source code of the copy, and you'll see who requested it. You can then go after that IP.

BTW: if they're removing/replacing domain name of your site, try obscuring it with HTML entities. This may dodge simple find'n'replace.


I think it works the following: Assuming the proxy has a different IP pointing to it's client, by inserting the IP it uses to connect to the original server into the HTTP reply (HTML/body code), it can be exposed to the OP. However, since he seems to have access logs and seems to understand the proxy requests pretty well, I wonder how it actually helps.


That's clever, and I just understand after a while.

Now let's say that your website will show the ip of whoever visit it, in one of textbox. When you access it it shows your ip. When the proxy sever access it it shows the proxy sever's ip. When you access the proxy site, the proxy site will access your site, having their ip on one of the text box, then return the page with their ip to you.

The more advanced method is to encrypt the ip and put it hidden somewhere, on later for you to decrypt it, get the ip and black list them.


> They are stripping all JS

Are they now?

Add a `visibility: hidden` to random elements on the page, and show them with javascript.

OR

Are they removing _all_ js? Have you checked whether they remove `<body onclick="some javascript code that injects some other code">` ?

You can try to do script injection _into your own site_ to see if their mirroring software is smart enough to deal with all the different xss vectors.

Bonus points: if they remove your <body onhover=>` attribute, add a style like

body { display: none} body[onhover='the js code that they will remove'] {display: block}


Just try some of the polimorphic xss tricks hacker try in order to get JavaScript into a page. Portswigger has a wonderful page of an extensive xss list.


Look at your traffic logs and see if you can't fingerprint the scraper. Should be relatively easy since they're mirroring your entire site.

Then instead of blocking the fingerprint, poison the data. Introduce errors that are hard to detect. Maybe corrupt the URLs, or use the incorrect description or category. Be creative, but make it kind of shit.

It's easy to work around blocks. Working around poisoned data is much harder.


This... there are definitely aspects of the proxy that they aren't configuring or are unaware of.

i.e. ssl_cipher, http_x_requested_with, http_accept... and the order of all headers supplied... the casing of all headers supplied... TLS client HELO.

It is relatively easy, if you have enough signals, to essentially create a fingerprint that they won't understand how it works. Yet it will be effective at blocking it regardless of the IP.

Once you add enough of these together it will be hard for them to get around it without being obvious as they do so.

Super aggressive... those same fingerprints will reveal legit browser traffic and the fingerprints for things like Google-bot... so you could go towards a whitelist rather than blocklist. But this is a place you'd have to actively manage as new variations arise constantly.


This is some really cool anti-scraping inside baseball. Is it safe to say that Cloudflare uses these techniques for weeding out bots?


It's safe to say that if you have enough signals from every possible layer (of which the above a barely a few) that it becomes trivial to build a model that can identify the majority of bots.

However, then you're left with the really hard problem of when real browsers are used. But hey, you went a long way before you had to actually look at traffic patterns and in the meantime you've significantly raised the costs for those operating the bots.

It's also worth noting that if you really get enough signals, that bot writers cannot control them all. Everyone can rewrite a HTTP header, but can you pick the right HTTP headers in the right order with the right TLS cipher and TLS HELO to appear to be the same as Chrome on Windows? Good luck.


HN probably won't like this but if they are blocking all JS you can make all content invisible with CSS and use JS to unhide it before page load finishes. Temporarily of course until these guys go away.

The nice thing about this is it can be made arbitrarily complex. For example you can make the page actually blank and fetch all the normal, real content with JS after validating the user's browser as much as you like on both client and server. That's what Cloudflare's bot shield stuff does. Since JS is Turing complete there is no shortcut that the proxy can take to avoid running your real JS if you obfuscate it enough. They would have to solve the halting problem.

What a determined adversary would do is run your code in a sandbox that spoofs the URL, so then your job becomes detecting the sandbox. But it's unlikely they would escalate to this point when there are so many other sites on the internet to copy.


HN contains multitudes, I love this response.

At the very least you collect info about their sophistication level; will they adapt to adversity or will the bail/move on?


I say that because I know there are a lot of people on HN who browse with JS off and rail against sites that require it. But sometimes you need it.


I very much would leave the website if I'm opening the site for the first time and it doesn't even render partially, but I recognise people like me are the minority of potential visitors.

Unless you have a tiny target audience that includes people who tend to disable Javascript, this seems like a fine solution. I use JS to hide email from the most basic of scrapers most of my sites and so far it has worked wonders.

However, long term this seems like a solution that will be difficult to maintain. Either the proxy will start stripping CSS as well (s/hidden/visible) or it'll require constantly playing along in a cat-and-mouse game that you don't stand to gain anything in by winning.

I would add random, page-like endpoints that only you know and request them through their proxy (through VPNs/TOR/you name it). Ban the entire /48 (IPv6) or /24 (IPv4) or send them into a tar pit (iptables -A INPUT -p tcp -m tcp -dport 80 -j TARPIT for the IP address you target) to exhaust their resources.


I personally like to use JavaScript on websites and indifferent to the no-js movement. I think the assumption of HN being no-js is because of the vocal few.


Some "less technical" suggestions after being inspired by other creative suggestions here:

Put brandings/personalizations/signatures in your pages that are not easy to remove to remove automatically. Include your site URL if possible. The idea is that if a visitor sees these on a different site, it becomes obvious that the content doesn't belong there.

Write an article page about these things happening, specifically mentioning the mirroring site URLs, and see if they will also blindly mirror it.


Can you use the fact they they're proxying to prove to Bing and Google webmaster tools that you own their domain, and delist it? The verification is done by serving a file provided by Bing/Google.


If they're proxying /.well-known/acme-challenge/, you should be able to get a TLS certificate in their name through Lets Encrypt.


I used to do this to other websites (we won't go into why) - one thing that may help you is to always return your HTML responses gzipped, regardless of whether the client asked for them or not, so ignore Accept-Encoding preferences. This makes it harder for their server to rewrite your URLs on demand, and most clients will accept gzipped responses.


We had a similar situation, though it was just a snapshot. We found out they were hosting using S3, and filed a DMCA request with AWS. It was taken down and hasn’t returned.


When someone did it to us we replaced the served content with ads for our site.


That sounds like the most elegant approach so far.


If they are serving all files, that should work for systems that check if you are the owner by asking to serve a file as a response to a challenge.

The copy is using ZeroSSL. This seems to use a similar mechanism like letsencrypt to verify certs. Maybe, you could get their certificate by serving the response to their challenge from your server. Not idea how to proceed from there.

Or activating the google webmaster tools. Maybe there's some setting "remove from index" or "upload sitemap" that could reduce its visibility on google.


That's actually a good idea. Apparently it's possible to revoke the certificates via the ACME API, even when you are using another ACME account: https://letsencrypt.org/docs/revoking/#using-a-different-aut...


Don't block their IPs, but rather return them subtly wrong content that isn't broken at the first glance. Insert typos, replace important terms, inject nonsense technobabble, make URLs point to wrong pages, inject off-topic SEO-spammy keywords that search engines will see as the SEO spam they are.


If they are stripping all JS, you can make the page work only with JS enabled :/


You can overload their servers as they pass on any URL to you. Just make sure their resource usage is significantly more than yours. Eg. serving a gzipped response that contains millions copies of the same URL to your site. Ideally, make it not fit in their memory. Or, simply large enough to take really long to compute.

Put it under a URL only you know, then start DoS-ing it.

Of course that requires you to be able to serve a prepared gzipped resonse, depends on your stack.


Could you prepend full urls to your website on assets and trigger cors requests on all assets? That would make it really annoying to proxy


Maybe some logic honeypot would be good, such as a infinite content paging list with some random trigger hidden at pages with non-sense titles. When one IP hits these triggers, it is automatically banned.

Bots will trigger it by walking through all pages, but real human would not click in since the paging is non-sense and titles are non-sense.


Yeah, but I don't want to ban bots. Also, they are not actively crawling anything, but rather mirroring the content on demand. At least that's my observation so far... thanks anyways.


This is a game of cat and mouse; although engineering approaches are fun, it's primarily an organizational/legal challenge, not a red/blue team exercise.

The first line of defense is contacting the relevant authorities. This means search engines, the hosting provider, and the owner of the domain (who may not be the abuser). Be polite and provide relevant evidence. Make it easy for them to act on it. There'll be some turnaround time and it's not always successful, but it's the best way to get a meaningful resolution to the issue.

What about in the meantime? If all the source IPs are from one ASN, just temporarily block all IPs originating from that ASN. There'll be some collateral damage, but most of your users won't be affected.


Block everyone else except them and start hosting disney content. Then give the mouse a ring


Does it hurt you in any way? If not - I would just leave it alone. Google can tell a copy from original. I tried to search some arbitrary text from your website - there is no trace of copycat in Google SERP.

What striked me, though, is that a copycat website is waaaay faster than your original. If I were in your shoes, I would invest my time and effort into speeding up the site. Unlike hunting some script kiddies, that will bring palpable benefits.


Bing is directing searches for their service to the fake web site, which is then serving up porn after a few seconds delay. I'd say it is hurting them.


Have you looked at filtering the traffic by ASN? You may be able to identify the provider your adversary likes to use and apply some of the controls musabg suggested to any traffic sourcing from these networks.

I have a website doing this to one of my domains. I have let it slide for now since I get value out of users that use their site too, but I have thought about packing their content with advertisements to turn the tables a bit.


What about steganography?

If you change subtle details about spelling, spacing, formatting, etc by the source IP, then you can look at one of their pages and figure out which IP it was scraped from.

Then, just add goatse to all pages requested by that IP. Alternatively, replace every other sentence with GPT-generated nonsense.

EDIT: it should be quite easy to use JS to fingerprint the scraper. The downside is that you will also block all NoScript users.


The following can be done for free without an API key or Shodan account:

1. Grab the list of IPs that you've already identified and feed them through nrich (https://gitlab.com/shodan-public/nrich): "nrich bad-ips.txt"

2. See if all of the offending IPs share a common open port/ service/ provider/ hostname/ etc. Your regular visitors probably connect from IPs that don't have any open ports exposed to the Internet (or just 7547).

3. If the IPs share a fingerprint then you could lazily enrich client IPs using https://internetdb.shodan.io and block them in near real-time. You could also do the IP enrichment before returning content but then you're adding some latency (<40ms) to every page load which isn't ideal.


Have you considered enabling HSTS on the webserver with dynamic endpoints and rate limiting after detecting the flagged IP, literally to a crawl?

I seem to recall someone doing something similar at one point hosting files and setting up resources that get pulled down only on flagged IPs such as a 300kb gzip encoded file that tries to expand to 100TB.


Maybe they're also proxying URLs like the HTML verification files that search engines have you upload to claim the domain as your own?

You may be able to claim their domain out from under them and then mess with search settings (e.g. In Google Search Console you can remove URLs from search results).


Add the worst content imaginable to the page, but don't make it visible by default. If the site strips JS, then use CSS to only show the terrible content when it's shown on that domain. You can use css to check the current domain based on e.g. links.

Extra points if you can cause legal trouble for whoever runs the site. If you're hosting rather large files, then you can also hide content by default that will never be loaded on your site, but will load on the other site. Add a large file to your site, then reference that file a few thousand times with query params to ensure cache busting, and then make the browser load it all using CSS when it detects that it runs on the other site.


Other posts mentioned ways to detect, like:

  - "bait urls" that other crawlers won't touch
  - trigger by request volume and filter out legit crawlers
  - find something unique about the headers in their requests that you can identify them with
One additional suggestion is to not block them, but rather, serve up different content to them. Like have a big pool of fake pages and randomly return that content. If they get a 200/OK and some content they are less likely to check that anything is wrong.

Another idea is to serve them something that you can then report as some type of violation to Google, or something (think SafeSearch) that gets their site filtered.


Redirect via meta tag

<meta http-equiv="refresh" content="time; URL=new_url" />


won't survive a grep


Still never easy, they need to do a small amount of work. Shot across the bows, so to speak.


add some CSS to mess with their URLs

a[href*="sukuns"] { font-size: 500px!important; color: LimeGreen!important; }

pretty much destroys the page. i guess eventually they would give up in the specificity battle.

probably more stuff you could do with CSS to mess with them.


Use them a free CDN? User page from your domain actually downloads content through them but with your ads. (Maybe for the continents which are not your primary market.)

(Less economical if they're not caching anything.)


Facebook had multiple approaches to keep users seeing ads (ie your message) on their site despite ad blockers. Could you mix a message amongst some content mixed up with some elements? Hopefully would not affect rankings too much, but could at least reach users. https://www.bbc.co.uk/news/technology-46508234.amp

Base64 encoding images with watermarks may also be worth a shout.

Love the zip bombing.

Long shot but I wonder if its possible to execute some script on their server.


There are ways you can fix this yourself but like all things it's way easier to just get a managed solution. CloudFlare or similar should give the necessary tools to block these types of sites.


.us.to subdomains are sourced from (dynamic) dns provider, FreeDNS: https://freedns.afraid.org/


Want to have some fun?

Happened to me back in the days of blogging.

Posted an image of me mocking them on my blog. Sure enough they published it and they didn't notice for a while. They stopped it soon after :)


You might be able to do an origin filter on the headers for requests to your backend (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-...). Check and make sure the x-forwarded-for header is what you expect and if not, block the request at middleware level.


Block all requests having "https://sukuns.us.to" as "Referer" HTTP header.


Requests are proxied so the proxy can rewrite the Referer HTTP header at will, AFAIK.


It looks like they're also downloading images directly from your domain, I see https://www.saashub.com/images/app/service_logos/129/k2q4pxz... for example in my debugger.

Edit: you could maybe add a <meta> tag to define a CSP in but I guess they will remove it [1].

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP


Sending a mail to the hosting provider is helpful. Also, if you are looking at blocking IP you can try blocking an entire ASN temporarily to see how that works. It’s one thing for someone to destroy a server and reimage it on the same service. It’s another thing to destroy it and bring it up on a fresh provider. Currently the attacker is using dedipath for example. Block the ASN while waiting for their abuse team to respond.


You can be able to identify those requests by inspecting the TLS cipher. Cloudflare Workers has that value in `request.cf.tlsCipher`[0]. Keep in mind the collateral damage it may have, though.

[0] https://developers.cloudflare.com/workers/runtime-apis/reque...


I just checked the offending site - it’s full of malware. I think if you report that aspect then you might get faster resolution from the search engines.


Quick and easy first:

1) Add a watermark to your images when they proxy to you.

Stolen image from {url}

2) Add a js script when the url differs from yours and display a message + redirect


When you detect their ip.. - serve images upside down - serve images blurry

More examples here from long long ago.. http://www.ex-parrot.com/~pete/upside-down-ternet.html


Um, so I clicked the second link, and was redirected to a not-safe-for-work website.

Luckily, I am at home, and my children are at school.

I have no idea what happened, or why I got redirected, but I can certainly suggest not taking up the idea to serve disgusting content (given I clicked a link that someone on HN posted, I shouldn't be subjected to that).


Fail2ban might help

Even with IP rotation, a proxy website would probably generate more traffic than normal from these few IPs, tweak fail2ban vars so you to make it less likely to trigger on false positives (larger number of requests / larger amount of time) but block the violating IPs for long period, few days for example.

I hope it helps


One thing you can do is add a canonical to each page, which will help solve the the Bing/Google issue until they realise it's there. Do it before they add one.

You're already using Cloudflare, you could try talking to their support or just turning up settings to make it more strict for bots.


*.us.to are FreeDNS subdomains, I would contact them. Additionally you could do a whois and contact the ISP.


Lots of good suggestions here, let me throw one more in the pot -- could you do an equivalent of a "ghost ban"?

Instead of blocking their IPs, detect if the traffic is coming from the abuser's IPs, and serve different content -- blank, irrelevant, offensive, copyright violations, etc.


This will change if they switch hosting but here’s a list of all the ip prefixes for their current hosting provider.

https://bgp.he.net/AS35913#_prefixes

The IPs they switch between may all be from this pool.


Kill their ad ranking by inserting explicit words into the content body only porn sites use (like get the categories/some video titles from pornhub). Google only shows results from adult sites when the search string also has explicit words in it.


Lots of great ideas here. A slight variation or emphasis on some: Specifically aim to advertise your own site on the other one. While you can anyway. Free advertising to their (should be your) audience, in return for what they're doing... Seems fair!


there are infinite mitigations and it will always boil down to how much they want to do this vs how much you want to prevent them. in the end they could render in a remote controlled browser and use cdn or aws ip adresses en mass, i would consider highjacking their users in subtle ways like replacing pictures or text with obscenities or legal disclaimers. unfortunately their motivation is ad selling to other dodgy companies so unlikely you can mitigate that way. i would also invest in getting the seo in order and having them removed from google if possible. lastly there are solutions like cloudflare turnstile that impact normal users not as much as in days of captchas


Maybe op only needs to do enough to undermine their website, rather than drive them away.

it’s possible the combination of blocked image hotlinks, watermarking the domain inside the images, and CSS trickery that messes up the page on the proxy (along with whatever other steps that can be thought of to make it look wrong or erroneous on the proxied site) could get op bumped to #1 on search on enough links that it no longer matters.

Given the other site isn’t generating original content it’s unlikely to ever get its google juice back.

On a side note - does Google have an option for this? I’m sure they must have encountered this before and it helps the quality of their results too to block obviously fudged content.


I would try googles phishing report (as others here have reported and allready done)https://safebrowsing.google.com/safebrowsing/report_phish/ even the example here is not targeted at stealing user dat per se


Believe it or not ICANN actually takes abuse reports seriously:

https://www.icann.org/resources/pages/abuse-2014-01-29-en


Dynamically generate all the content in the browser to a canvas element. No HTML to steal.

More simply you could just make all the HTML links broken unless some obfusticated or server-backed algorithm is run on them. Think google search results.


Had the same problem. They used a scraper which runs on Amazon AWS ... so I blocked all Amazon AWS IPs (google for the list of IPs ... and than for a script which creates you NGINX rules for all IPs). Works quite well.


Depending on the nature of access patterns you might be able to automatically block the IPs by tuning the parameters on fail2ban (if you have a server) to block the proxy IPs.


Some years ago I let expire my blog domain, only to find out that somebody bought it and was serving a mirror of my content plus scam ads. I reported them to their DNS provider and they were gone in 2-3 weeks.


Instead of rendering server side, render client side. If they block JS, they get nothing. In JS script check for hostname and if it matches their hostname, don’t render anything.

Potential downsides: SEO.


If direct (non proxied) access from the search engine spiders can be identified serve the real robots.txt. Otherwise disable crawling. Also, switch meta noindex like this.


What about blocking by ssl fingerprinting? Established browsers have known fingerprints derived from how the ssl request is made, supported connection options, etc.


Anyone have any idea what this does? It's embedded in the copycat site's source: earlierindians.com/39faf03aa687eeefffbe787537b56e15/invoke.js



Cool bot detection!


I'm not going to focus on the problem here. I just want to say that I like the idea behind your website, a good source of market research. Bookmarked it.


Once upon a time, I served all the static content base64-xored with a session key on the backend, and decrypted the content on the front-end with JS.


I wonder what they get out of this. Injecting ads perhaps?


I saw another comment saying that the site sometimes redirects to adult content sites.


Landing place for a phishing campaign. See: watering hole attack.


Yep, they injects ads


If they copy the Javascript too, you could a code that checks if the domain matches and if not show a blank page (or something like that)


You could maybe put your website behind a captcha? Google's recaptcha works behind the scene, so it won't affect normal users.


Just block all of the OVH IP ranges?

https://ipinfo.io/AS16276


What’s the motivation for someone to proxy mirror a site?

Does copied content even rank in Google? How are they driving the traffic to it?


Motivation is money, the proxy site is serving their own ads.

According to OP, it ranks pretty well on bing.



Another option would be to use a service like Cloudflare, which offers protections against scraping and other malicious behavior. This can help prevent the proxy-mirror site from being able to access your site's content.

https://blog.cloudflare.com/introducing-scrapeshield-discove...


I have Cloudflare at the front already. The issue is that they are not actively scraping the content but rather mirroring it on demand.


Enable bot protection in Cloudflare as per your plan https://developers.cloudflare.com/bots/get-started/


Just an idea but maybe you could cause big loads on their servers by requesting in parallel a large amount of urls where you actively serve a gzipped massive html file that is full of links to your website.

EDIT: or building up on what user zhouyisu says above you can generate your perfect match IP blacklist by calling urls via the abusing site that automatically puts any caller into the blacklist.


I'm not sure how easy would be to serve a "zip bomb" without getting into trouble, but it would be neat


If you're on Cloudflare and they're stripping JS wouldn't a JS challenge be appropriate?

https://developers.cloudflare.com/fundamentals/get-started/c...


Pretty anti-user though


Because it requires JS? How about this then...

    a[href*= "sukuns.us.to"] {
     display:none; 
    }
Then use SRI to enforce that CSS.


Then where they're coming from should be exceedingly visible in logs (via splunk or whatever), so deny those requests.


Generate your pages from Javascript.


This made me wonder whether something similar was happening to my domain?

How would one go about finding out?


Maybe as simple as a Google / Bing (Yandex / Baidu?) search using some text likely to be unique to your site ("enclose the text in quotes" like that).

If your site had been copied but wasn't indexed by the big search engines, that would presumably limit the options to make money from it?


i cant see your site. the mirror/proxy is useful and doing its job.

Access denied Error code 1020

You do not have access to www.saashub.com.

The site owner may have set restrictions that prevent you from accessing the site.

Error details Caret icon Was this page helpful?

Performance & security by Cloudflare External link


DMCA takedown?

They're serving your copyrighted content. Seems like what it was made for.


Their domain is likely to be considered hostile, and drop in Google results.


What an idea. I will do same with some popular websites.


Implement a reverse swearword filter for their IP.


If you have a trademark, domain takedown always works.


That isn't neccesary true. While I definitely agree that the OP should pursue that path, many providers are in other jurisdictions.


> 3) Add some JavaScript to redirect to the original domain-name - They are stripping all JS.

Make your site only work with JS.. Easy.


I tried to look up their site then realized I block "us.to" locally. Since you have their site linked in this thread they are likely seeing the HN thread as a referrer in their access logs and reading this. I expect this to turn into an ongoing battle as a result, but maybe this could be a fun learning exercise for everyone here.

The current IP 45.86.61.166 is likely a compromised host [1] which tells me you are dealing with one of the gangs that create watering holes for phishing attacks and plan to use your content to lure people in. They probably have several thousand compromised hosts to play with. Since others mentioned you could change the content on your site, I would suggest adding the EICAR string [2] throughout the proxied content as well so that people using anti-malware software might block it. They are probably parking multiple phishing sites on the same compromised hosts [3].

This would also be a game of whack-a-mole but if you can find a bunch of their watering hole sites and get the certificate fingerprints and domains into a text file, give them to ZeroSSL and see if they can mass revoke them. Not many browsers validate this but it might get another set of eyes on the gang abusing their free certs.

If you have a lot of spare time on your hands, you could automate scripting the gathering of the compromised proxy hosts they are using and submit the IP, server name, domain name to the hosting provider with the subject "Host: ${IP}, ${Hostname}, compromised for phishing watering hole attacks". Only do this if you can automate it as many server providers have so many of these complaints they end up in a low priority bucket. Use the abuse@, legal@ and security@ aliases for the hosting company along with whatever they have on their abuse contact page. Send these emails from a domain you do not care about as it will get flagged as spam.

Another option would be to draft a very easy to understand email that explains what is occurring and give that to Google and Bing. Even better would be if we could get the eyes of Tavis Ormandy from Google's vulnerability research team to think of ways to break this type of plagiarized content. Perhaps ping him on Twitter and see if he is up to the challenge of solving this in a generalized way to defeat the watering holes.

I can think of a few other things that would trip up their proxies but no point in mentioning it here since the attackers are reading this.

[1] - https://www.shodan.io/host/45.86.61.166

[2] - https://www.eicar.org/download-anti-malware-testfile/

[3] - https://urlscan.io/result/af93fb90-f676-4300-838f-adc5d16b47...


Lol:

> How to delete the test file from your PC

> We understand (from the many emails we receive) that it might be difficult for you to delete the test file from your PC. After all, your scanner believes it is a virus infected file and does not allow you to access it anymore. At this point we must refer to our standard answer concerning support for the test file. We are sorry to tell you that EICAR cannot and will not provide AV scanner specific support. The best source to get such information from is the vendor of the tool which you purchased.

> Please contact the support people of your vendor. They have the required expertise to help you in the usage of the tool. Needless to say that you should have read the user’s manual first before contacting them.


I would just contact cloudflare via discord. They will know what your best recourse is


well... the only reasonable thing to do is to find a hosting that accepts monero as a payment, rent a baremetal server with ipmi acces, cript the hardisk with lusk and veracrypt, scan 0.0.0.0/0 for unpached dns servers and start ddossing the mirror site




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: