
Comment spam random text template - Titanous
https://gist.github.com/shanselman/5422230
======
qeorge
Its a sad moment when you first have to explain to your mother/client/etc that
no, Discount Oakley Sunglasses didn't really enjoy your superb post.

Its a shockingly effective spam tactic. People cannot resist praise.

~~~
petercooper
I must say that you've done an excellent job with commenting on Hacker News.
These are genuinely fascinating ideas and I have spent 3 hours today reading
your comments. If more people left comments like you do then the Internet
would be a better place.

~~~
orangethirty
I really enjoy listening to your podcasts. I always learn something from them.
In the last episode you featured a new project that is very close to one I'm
working on right now. It'd be great if you checked it out, because it has
features the other project doesn't. Features that I think make it better and
easier to use. But enough about that. Any news on the new podcasts?

 _That's how you do it._ (:

~~~
petercooper
Except you're clearly not spamming but actually referring to my work.. or are
you! Oh, you win..

~~~
orangethirty
That's why I get paid the big bucks. =P

In all seriousness, I do listen to your podcasts. Proof: You forgot to renew
the domains once, and joked about it.

------
SmileyKeith
I have a friend who works for a large marketing company where they write
website content for companies that looks almost identical to this. The company
hires hundreds of young writers at a time and pays them minimum wage to come
up with a few of these every day. They pass their paragraphs around to other
writers who reword it again so it will show up as a unique site for search
engines. I call it the spam factory.

~~~
Matt_Cutts
What's the company?

~~~
tomjen3
By commenting here you just proved that google has trouble identifying these
kinds of sites algorithmically.

~~~
dpe82
Or he's just curious who his adversary is. I would be.

~~~
Nick_Ker
Or he is doing his job!

------
DanielBMarkham
I love a good con, and it's instructive to teach people how they are tricked
so that it doesn't happen to them.

One of my favorite spams happened a couple of weeks ago. The spammer said he
had written a very long post about my excellent article, but his computer
crashed, so he wasn't going to repeat it.

This is very similar to "Gee, if you could only have read my comment, you
would have loved how on-topic and awesome it was!"

We've created this system where we are paying people to drop by our blogs and
tell us how awesome we are. Strange world.

------
bcoates
Isn't a trivial templating system like this a very easy thing to train a
bayesian spamfilter on, even without the source template? I guess blogspammers
mostly prey on entirely unmaintained sites for something like this to work.

But I'm going to have a hard time avoiding using "fastidious!" as a general
expression of approval now.

~~~
klodolph
Naïve Bayesian filtering might not work very well on this kind of text. It
basically looks like a regular comment, until you start recognizing that it
always follows the same pattern. Your basic Bayesian classifier will throw all
of the words in a set before analyzing them, which loses all of the
information about patterns and word order. The resulting words are considered
"independent" which means that even though the template might generate the
words "pretty worth bloggers content online" every time it uses the first
template, the naïve Bayesian classifier will _never_ figure that part out.

My suspicion is that existing Bayesian classifiers have pushed the spammers to
develop more natural-seeming templates, like this one.

~~~
A1kmm
You can run a Bayesian filter over pairs (or even triplets) of words (although
this could cause the probabilities to be a bit off, because pairs like
"although this" and "this could" are not truly independent). The downside is
that as you do this, you drastically increase the size of the model and the
amount of training data required.

Bayesian filters can also take more than just the words in the text into
account - for example, they can take the submitting IP address (or perhaps /24
or ASN) into account, or a spam classification from external sources.

There are certainly better methods that could be built for recognising unknown
templates - a simple known-state Markov model would be sufficient for the
cases where templates substitute one word at a time, and you could conceivably
use an unsupervised learning algorithm to discover an unknown number of models
from a large corpus of comments.

------
Titanous
It looks like this factors out to 4,351,250,624 unique comments.

~~~
brador
How did you calculate this?

~~~
A1kmm
The set of strings is isomorphic to a Cartesian Product of sets of the same
cardinalities as the set of options (for example, if the template was "{I,We}
like {HTML, CSS, Javascript}", you can make a set {0,1}x{0,1,2}, where each
element in the product set maps to one string and vice versa. For example
(0,0) might represent I like HTML, and (1,2) might represent We like
Javascript).

Because there is a one to one map (bijection) between the Cartesian Product
set and the set of strings, the size of the product set is the same as the
size of the set of strings.

The cardinality of a cartesian product AxB, where A and B are sets, is
|AxB|=|A|x|B|, so to find the size of the set of strings, you just need to
multiply together the number of options at each point in the template where
you have a choice.

~~~
brador
Multiply the count in each set, gotcha. But did he just count it for each set?
It seems like that would be time consuming. Is there some shortcut? or did he
just write a quick program to do it?

~~~
Titanous
I used
[https://github.com/flintinatux/spintax_parser/blob/c356ebd88...](https://github.com/flintinatux/spintax_parser/blob/c356ebd88a4e6da51dd6bd6d480a00fbe9883809/lib/spintax_parser.rb#L16-L26)

------
MitziMoto
It's called "Spintax". People in the internet marketing business use this all
the time. Not just for comment spam, but to generate different versions of
entire articles for submission to article directories and what not.

This is one of the more well known "spinners" as they are called:
<http://thebestspinner.com/>

------
manacit
1 Million results on Google for "I have been surfing online more than 2 hours
today, yet I never found any interesting article like yours" [1]

[1]:
[https://www.google.com/search?q=I+have+been+surfing+online+m...](https://www.google.com/search?q=I+have+been+surfing+online+more+than+2+hours+today%2C+yet+I+never+found+any+interesting+article+like+yours)

~~~
clauretano
Since that's only one of 16 combinations, I checked another. 7.7mil results if
you change the 2 to a 3. Well then.

------
Titanous
Here's a quick script that spins the text.

    
    
        gem install spintax_parser
    
        ruby -rspintax_parser -ropen-uri -e 'String.send(:include, SpintaxParser); puts open("https://bit.ly/Ziv9Aw").read.gsub("\n", " ").unspin'

~~~
christiangenco
Whoa, `spintax_parser`? ...are there legitimate uses for this?

~~~
hayksaakian
Perhaps generating contents to use with a testing suite?

------
larvaetron
Man... I'd actually get to an ATM, but I took during supervised visitation to
see if they actually had to talk to any judgment. I don't know what is written
so we could draw the appropriate details. I had the Sponge Bob Lego set. Not
only did they change the social worker again - if lowering the lien can be
cancelled or reduced, Windows Vista would not let that get in contact with me.
Will we ever know the rest? I told her my name when I signed in today... and
then cashed it and put six graveyard cards down on them and that trying to get
used to have the present ability to make a deposit. This is the reason it was
too much time as an and the house was not there. Maybe whoever poured stuff on
my answering machine fell... and now they can't even eat a slice without it
tearing and falling to pieces. She is now one of my underwear that is being
hacked into, they probably figured out how much it would cost for the attic
that flew off the bus. It did install when I told her that she was first
soldered, but me... how important a father to a building! I walked there from
2003! Yesterday, I called my lawyer and said that I will clean the apartment
tomorrow.

~~~
GhotiFish
Markov chain?

I'm going to make one when I get home today.

------
ChuckMcM
Oh god that is priceless. In the right hands you could create a regex that
would block all spam from this guy!

~~~
kmfrk
It's an expensive operation, though. I doubt it scales, but I'm sure it's
great for smaller services. WordPress and Disqus are probably going to use
this somehow.

~~~
susi22
Disagreed. This is a super cheap operation to test

------
dpapathanasiou
Nice, I should update my old blog post about this:
[http://denis.papathanasiou.org/2010/08/24/spam-apalooza-a-
su...](http://denis.papathanasiou.org/2010/08/24/spam-apalooza-a-survey-of-
modern-blog-comment-spam/)

------
lenazegher
This kind of thing really depresses me. Akismet tells me my blog [1] has had
1,364 spam comments. It's had 22 real comments.

[1] <http://supplementsos.com/blog/>

~~~
bapbap
My site gets about 15,000 spam submissions a month. You are able to make a
submission without being logged in but you have to signup as soon as you do
for it to be published. Spammers don't do this, so the spam doesn't get
published, yet they continue on regardless.

I also get spam signups who just signup for an account, some of the more
sophisticated accounts will verify their email address but then they don't do
anything. Very few ever come back again (if they do within a certain period of
time their IP block gets extended).

For some reason, no spammer has ever put these two scenarios together to
successfully spam my site. I wonder how they found it and why they continue to
try, surely if you notice your spamming isn't working, you take the site off
your list, or if you a really determined, try a bit harder.

Edit: What has annoyed me is, as soon as I became aggressive at tackling the
spam (even though none ever got published) by 403'ing their IP's, they started
spamming other sites with links to mine. Even with my explanations to Google
about how it is not my doing, my site has been penalised, so I guess they win
that way.

~~~
matznerd
Remove any identifiable "footprints" from software you run on your server,
like "Powered by wordpress" that should help you out...

------
Narretz
I especially hate the kind of spam you find in the comments over at
rockpapershotgun.com, which looks similar, but is more annoying as it
interrupts the comment tree:

Joshua. I just agree… Bonnie`s postlng is good, on sunday I bought a gorgeous
Acura after bringing in $7140 thiss month and-more than, $10,000 last-munth.
this is certainly the coolest job I’ve ever done. I began this seven
months/ago and straight away was bringin in over $81 per-hour. I follow
instructions here <url>

It was very pervasive for some months, but it looks like the guys finally
found a way to block most of it. Shouldn't be too hard with alle the numbers
and dollar signs.

------
Houshalter
What is the point of this? I thought the point of spamming was to insert links
to your website or promote some product. This just looks like random
compliments.

~~~
OGC
From <http://codex.wordpress.org/Comments_in_WordPress>

Depending on your site's settings, comments display slightly differently from
site to site. The basic comment form includes:

Name

Email

 _Website_

Comment

~~~
YokoZar
I have the website field turned off -- it's not visible if you are a real user
using the form, and "website" doesn't show up next to comments.

But I still get a lot of spam containing the website field; it seems like the
bot is just automatically submitting POST requests with website included. This
should be trivial to detect -- is there a plugin that just immediately deletes
all of these out of hand before passing them on to an antispam tool?

~~~
csmattryder
There's a way of setting the 'hidden' property on an input field and checking
if the field is blank.

If it's a bot sending POST requests, there's a high chance a "Website:" field
would be filled out. Innocent users will not be able to fill out the field, in
the case of a legitimate comment.

~~~
jroakes
This is called a honeypot field. Basically spam bots are stupid and try to
fill in every field (especially ones with phone, email, website, etc in the
name). The honeypot is a hidden field that if filled out causes the form to
not submit completely. There are some plugins for Wordpress that handle this
for contact forms (Contact Form 7) and for comment forms. If you search
Honeypot from the WordPress plugin directory, you will find them.

Also, many spammers hit the wp-comments-post.php file directly. Since they hit
it directly there is no referrer passed like there would be if a POST was sent
to it from a page on your site. This page: <http://goo.gl/n5VHm> on
WordPress.com has information on code that can be added to your .htaccess file
that will crush bots that POST to wp-comments-post.php with no referrer
present.

------
hmottestad
I got this one myself the other day.

Have a {great|good|excellent|fantastic} day!I'm very {happy|glad|pleased} when
see your post.I quite {agree with|endorse|approve of} your {point of
view|viewpoint|standpoint|views on politics|opinion on public affairs}.I will
continue to {focus|atte...

------
mardix
Oh Well! I just wrote CanOfSpam <https://gist.github.com/mardix/5438589>, a
PHP implementation to get random text. It randomly picks a comment, and
randomly picks optional text in the { } tags.

------
tannerc
Add in dynamic content (like mentioning a blurb from a past blog post, or a
tweet from the writer, or - even better - a tweet @ the author from someone
else prominent) and you've got yourself one slick marketer... erm, I mean
spammer.

------
b0rsuk
Don't captchas solve this ? Are spam tools able to bypass them too often, or
is it because people find captchas annoying ?

~~~
soult
Even a captcha that can separate humans from machines with 100% accuracy does
not defend against cheap labour ($1-5 / 1000 captchas) in developing countries
like India.

------
jrochkind1
Have others been seeing a huge increase in spam getting past akismet filter on
wordpress lately, or is it just me?

------
matznerd
As much as you guys are going to hate on my comment, well written spintax is
an art form.

------
becasual
about this they use spamming software for internet marketing which works thou!

