Hacker News new | past | comments | ask | show | jobs | submit login
HoneyPot – I Made a Text Field Only Bots Use – Heres What Happened (github.com/lee101)
160 points by wrdsmsh321 on Nov 19, 2023 | hide | past | favorite | 48 comments



I’m curious whether those who voted for this submission have ever taken a look at their server logs.

Almost every public website on the open Internet receives thousands of HTTP requests similar to the ones mentioned in this text file. This is one of the several reasons why web application firewalls gained popularity years ago, especially as vulnerability scanners became widespread.

Years ago, when I was employed at a young security startup, my colleague and I dedicated countless hours analyzing this particular kind of web traffic. Our objective was to develop basic filters for what eventually evolved into an extensive database of malicious signatures. This marked the inception of what is now recognized as one of the most widely used firewalls in the market today.


I sometimes take a look at the logs, but nowadays there's a lot of noise from "security" companies that scan probably all IP addresses and all ports with known vulnerabilities. And they do it the lazy way. They just fire a bunch of URLs at each port that responds: long hexadecimal URLs, wordpress admin end-points, oauth end-points, etc. In the beginning, they even sent emails to tout their services.

We use one of them for ISO certification. Twice a year, we turn on their "vulnerability scanner", which says its test over x-thousand vulnerabilities, we get a report, and everybody is happy. Only on the first run did it discover a small error in the nginx config. Unfortunately, it is theater.


Good comment, but I'm only replying to guess your name.

Colin? Michelle?


Sam


Altman?


I notice a lot of specific numbers and strings being used and repeated, what do they mean?

Do these injection attacks come from a single source perhaps, which everyone imitated?


Most of this looks like random data designed to detect if SQL injection is happening without crashing the query (to avoid detection). So, the random strings are effectively a token to check if it is found in the response, which indicates that the injection worked. Similarly for the sleep calls, the attacker would time the response.


This exactly. Any text field is "checked" to see if it is getting submitted unprotected to a database. Every single one of them.

When you run a search engine you will see queries that try to look through every page of results for the search query 'input type="text"' typically this will either come from an API query or to a search page that is fronting another index.


The sleeping is pretty clever, but presumably vulnerable to false positives if the queries are slow anyways. I wonder if they go the extra mile and time the load time with and without the attempted injection


It would be interesting to make a SQL injection honeypot that behaves like a database in most responses but is designed to maximally frustrate the attacker.


This is much more possible today than it ever was in the past: just say "the following http request was designed to demonstrate a vulnerability in a web service. Please explain what vulnerability this request is designed to detect, and what part of the response demonstrates the vulnerability. Finally, output an example of a response that a vulnerable service might produce in response to this request" to an instruction tuned LLM, and then return that response to the attacker (the "explain what is happening" bit is just to get a more plausible response).

As a bonus, your apparently vulnerable service would be incredibly slow, so any iterative testing would be incredibly slow.


I feel like that’s going to be quite expensive as a honeypot. Running LLM against all the script kiddies out there.


I wonder if there's enough repetition to get wins by caching.

(Granted, it's probably still too much overhead, just a thought.)


Reminds me of that classic riddle.

You come to a fork in the road. There are three statues, you need to ask them all one question and figure out which way is the right way to go.

One statue always lies, one always tells the truth and one kills people who ask convoluted questions.


Yes, a lot of tools, including some like w3af do:

https://github.com/andresriancho/w3af/blob/fb345a5/w3af/core...

This one sends the payload reversed as a test to see if the delay is due to the SQLi attempt


It would be a better idea to use the same query with a variable wait and see if there's a linear correlation in elapsed time.


That’s from a single person most likely, who used sqlmap to test for sql injection. I haven’t seen internet wide attempts of testing sql injections.


Our WAF logs are fun reading. We see so much traffic from bots looking for PHP files and posting to inputs.


Yeah, so much noise. I enjoy screwing around with them on my free time, "imposing cost" by giving back unexpected things. I don't know if it actually does something, but I bet returning either a gzip-bomb or a 5 MiB really obscure (but valid) HTML file will crash quite a few scanners.

https://nitter.net/gnyman/status/1181652421841436672


Are you familiar with OpenBSD tarpitting?


Not specifically openBSD but the concept yes, I've played with it also

https://nyman.re/super-simple-ssh-tarpit/


All of these look like things sqlmap tries, this could all be just one person that tried to target the server.


I did something similar about 15 years ago. They weren't as complex back then, and there was more effort to obfuscate them.

Mostly they came from Israeli Chinese and Russian IP addresses.


I ran a public-facing web server — you won’t believe what happened next!


Protip: I usually add a hidden input field to my forms. As it is hidden a normal user should not be able to fill it out, only a bot will. So if the hidden input isn't empty, I can disregard it as spam, it works wonders.


I do the opposite: a hidden field that is filled automatically by javascript. Yes, that means you can't submit without JS. That's a tradeoff I was ready to make. I'm actually surprised it still works as well as it does.


What if I don't want Javascrap running in my browser?


You can't use it - that's the trade-off...


For many sites, turning off JS is not an option. IMO, it's wasteful to ignore all that compute power in the browser. It's better to run code in thousands of browsers than do it all on the server.


You mean it's better to have the user pay for that CPU time.


The user already is. The casual user's system is idling with all sorts of nonsense. Adding some light processing doesn't harm. I'm thinking 100-500ms per page. You don't render the page for the user neither, do you?

For heavier use cases (e.g. image processing), the user should be willing to spend some CPU power. It doesn't make sense to send an image to a server, put it in a queue, wait for an image processing worker to run it, and send back the result. It's simpler and more sensible to run that process client-side, if feasible. E.g. LLMs are too big for that, but many other tasks can.


Some "light processing" like 10 seconds of instantiating <js framework of the day> crap that gives me nausea while it redraws infinitely and boxes move around on the page?

Even the mobile oriented samey SAAS sites that have you scroll through 20 screens to read 5 lines sound better...

Edit: Btw, 100-500 ms on what? The latest Intel 500 W space heater? And tested only in Chrome because it's too expensive to notice that it's not very fast or responsive on other browsers?

Edit 2: Not to be misunderstood. If you're doing the computation for me, go ahead. If you're doing the computation because your framework has 100000% overhead, no thanks.


I don't like heavy frameworks either. I try to keep everything light, both server and client-side. 10s loading animations is too much. But having no framework at all severely limits development speed.

All browsers are approximately equally fast nowadays. I use Firefox, so no worries there.


Yes


You are a very small, immeasurable part of the internet that most website owners don't really care about.


Same Here. Much better than any CAPTCHA!


One of the first bits of analytics I put on any webserver is to count all unhandled urls. As others here say things like WordPress admin page request probing are classic but I remember one of the Django designers pointing out that sometimes legitimate looking requests are actually a form of suggestion. That used to be a lot more true when people would try to play with urls to get to what they wanted.

Relatedly if you work in a field where your products become known as a useful benchmark you will find prototypes start showing up long before any public disclosure. We used to use this to be able to anticipate new screen resolutions and evaluate new GPUs and SoCs before being told about them.


I remember back in the day getting my first server online. Then a few months in I stumble across the ssh logs… let’s say it was quite handy because at the time were trying to come up with a name for our kid.

The internet is a jungle with dragons. Nowadays I try to keep everything on my vpn as an extra security layer.


I remember installing a Juniper Intrusion Detection System in a server rack on a telecom company. Was quite impressed when I saw in the logs such attacks were discovered and blocked. This was 15 years ago.


I've got a classic guestbook on an intentionally vintage page but I actually filter the input into "spam" and "humans". Here's the spambook: https://bootstra386.com/spambook.html

The filter system is open source https://github.com/kristopolous/BOOTSTRA.386/blob/master/hom...

Showing it's not impossible to have a classic anonymous guestbook, you just have to be a bit clever.

Yes, that's a zipbomb to a particular offender at the beginning. It worked. A script dumb enough to brainlessly slam the site easily broke against a zipbomb.


Line 44 is like a chronology of spam terms: starting at porn, then viagra, now bitcoin and online gambling.


The simple work of splitting up the nonce at https://github.com/kristopolous/BOOTSTRA.386/blob/master/hom...

Seems to be what makes the vast majority of the spam bots faceplant, at least the ones that spam random irrelevant sites like this one.


if (ststr($_REQUEST['textfield_name'], "UNION") { block_ip($_SERVER['REMOTE_ADDR']); }


Union busting management would be proud.


What's with all the CONCAT(0x71626a7a71, ... ,0x716b767071)? As ASCII they're qbjzq ... qkvpq


It boggles my mind that there is software out there where that actually works.


ah, sqlmap


I did the same on a real contact form before to reduce spam... If the field got modified, I knew it was a bot... Worked pretty well




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: