
Robots.txt as a Security Measure? - hjalle
https://www.cdsrc.com/posts/robots-txt-security-measure/
======
waisbrot
I was hoping this would be about putting an orphan path in your robots.txt and
then black-listing clients who tried to fetch it -- nobody should know about
it except robots who are told not to go there, so anyone who visits the link
is an adversary.

~~~
sixplusone
As one small data point:

I've been running this experiment (another comment). While bots continuously
hammer on port 22 (ssh), and repeatedly try to get things like _/ wp-_* (I
don't even run PHP), they don't bother fetching robots.txt in the first place,
and my honeypot hasn't a single hit.

Definitely do not try to "secure" your site this way, but bots are either not
sophisticated enough to analyze the .txt, or it might already be a known
technique. Seems many other commenters come up with the same idea.

~~~
mjbmitch
If you're an adversary trying to snoop on port 22, why would you bother to
respect the conventions of robot.txt to begin with?

~~~
sixplusone
Not necessarily the same bot. And they're not snooping so much as brute-
forcing default/common/random(?) usernames & passwords.

------
cobbzilla
"I don't always expose my production database on a public URL, but when I do,
I put a 'Disallow' in my robots.txt for it."

~~~
smacktoward
It reminds me of that scene from _Spaceballs._

"The combination is... 1-2-3-4-5."

"That's amazing! I've got the same combination on my luggage!"

------
tptacek
There's a running joke among web pentesters about robots.txt being the first
place you look when hitting a new site.

~~~
acdha
Meanwhile over in .gov I’ve had to explain to a pentester that it wasn’t a
security problem that robots.txt was accessible without authentication, based
on a very big vendor’s scanner having badly regurgitated the OWASP advice.

~~~
duxup
The "security" world has an unusually high level of total incompetence. It is
scary.

~~~
acdha
This is common any time there’s so much demand: in the late 90s it was not
uncommon to be in a room full of people who were ostensibly web developers and
didn’t understand how the web or their backend servers worked but were certain
they were about to become rich.

Security is especially bad because so many large organizations are under
pressure to improve but the market is tight and the pool of experts is
limited. Also, many places have outsourced to large contracting companies who
don’t want to admit they don’t have enough qualified staff and will hope that
you’ll be satisfied with whoever they deliver.

~~~
duxup
Yeah no doubt it is a phase.

It's just a really nasty phrase right now.

I always think of this:

[https://medium.com/@djhoulihan/no-panera-bread-doesnt-
take-s...](https://medium.com/@djhoulihan/no-panera-bread-doesnt-take-
security-seriously-bf078027f815)

------
dTal
Honestly, I think it couldn't hurt, _if done appropriately_. If crawlers are
indexing those pages, then they're publicly available anyway, and could be
crawled by a determined attacker - so nothing in robots.txt ought to be truly
sensitive. But if there's pages that _ought_ to be secure, but might contain
an exploitable vulnerability, putting their path in robots.txt at least limits
their exposure to those determined enough to look, rather than any lazy script
kiddie using Google to search your site.

Obviously you shouldn't _rely_ on it, but defense in depth as always.

~~~
acdha
If you want that as an additional safeguard, set the noindex header on that
path at your edge so you’re not calling attention to it:

[https://developers.google.com/search/reference/robots_meta_t...](https://developers.google.com/search/reference/robots_meta_tag)

I’d also strongly recommend pairing this with outside monitoring which alerts
if something accidentally becomes reachable since it’s really easy not to
notice something working from more places than intended.

------
brlewis
_It would even be a million times better to place the sensitive files inside
/TOP_SECRET_FOLDER and disallow the entire path, avoiding to explicitly name
the paths at least._

This is the only way to use robots.txt for semi-sensitive info, and obviously
not for info so sensitive that it would be awful for it to get out. URLs can
leak through proxy logs and shared browser history.

------
zaarn
Just put a fake /wp/admin/login URL (or similar) in the disallow rules, then
just IP ban everyone trying to access it for 24 hours. That's how you do
robots.txt Security.

------
dusted
Why in the world would you put things that shouldn't be downloaded ON THE
INTERNET to begin with.. If you then proceed to also tell the whole wide world
that you did it.. It's difficult to feel any empathy.

------
LinuxBender
You can also use robots.txt as a honeypot. Simply add some realistic looking
url's to the Disallow pattern and create rules in haproxy, nginx, or your own
custom scripts to catch anyone hitting those URL's and put them in a hamster
wheel. i.e. give then a static "Having problems?" page, or just outright block
them. On my own personal systems, I use "silent-drop" with a stick table in
haproxy.

------
slang800
On a similar note, tools like [grabsite]([https://github.com/ArchiveTeam/grab-
site](https://github.com/ArchiveTeam/grab-site)) wisely use robots.txt as a
method of finding additional paths to archive when crawling sites.

------
ltcd
Storytime!

In my ex-job we were developing e-commerce system, which was super-
old|big|messy 0-test PHP trash. After two years of actively working on it, I
still couldn't form a clear picture about the details of its subsystems in my
head.

One day there is a call from a client, saying that he is missing many of his
orders. The whole company is on its feet and we are searching for what went
wrong. We are examining the server logs just to find out that someone is
making thousands of requests to our admin section and tries to linearly
increment order IDs in the URL. Definitely some kind of attack.. Our servers
are managed by different company so we are opening a ticket to blacklist that
IP. Quick search told me that the requests are coming from the AWS servers,
and the IP leads me to an issue on GitHub for some nginx "bad bots" blocking
plugin, saying that this thing is called Maui bot and we are not the first one
experiencing it. Nice. Anyway, this thing is still deleting our data and we
can't even turn off the servers because of SLAs and how the system was
architected. So we are trying to find out how is it even possible, that
unauthorized request can delete our data. We are examining our auth module,
but everything looks right. If you are not logged in and visit the order
detail (for example) you are correctly redirected to login screen. So how? We
are reading the documentation of the web framework that the application is
using. There is it. $this->redirect('login');. According to the documentation
it was missing return before that statement. So without the return, everything
after that point was still executed. And "everything" in our case, was the
action from the URL. No one ever noticed, because there were no tests, and
when you tried it in the browser, you were "correctly" presented with the
login screen. Unfortunately, with side-effect.. Guy that wrote that line did
it 5-6 years before this incident, and was out of the company for many years
even before I joined. I don't blame him..

Fix. Push. Deploy. No more deleted orders.

POST MORTEM:

The Maui bot went straight to the disallowed /admin paths in robots.txt and
tried to increment numbers (IDs) in paths.

I remember, that because Maui bot actions were (to the system)
indistinguishable from the normal user actions, someone had to manually fix
the orders in the database just by using server logs and comparing them
somehow.

Sorry for my English, and yeah, (obviously) don't use robots.txt as security
measure of any kind...

------
nydel
i move my not-to-be-indexed stuff around a lot: renaming, archiving, etc. &
i've a bit of shell scripting and a commonlisp program that automatically add
things to robots.txt so that barely any of the things listed won't 404, and
the ones that won't are protected via htaccess.

not sure why i did this aside from that it was fun!

