Hacker News new | comments | show | ask | jobs | submit login
Comment spam random text template (github.com)
150 points by Titanous 1696 days ago | hide | past | web | favorite | 72 comments

Its a sad moment when you first have to explain to your mother/client/etc that no, Discount Oakley Sunglasses didn't really enjoy your superb post.

Its a shockingly effective spam tactic. People cannot resist praise.

I must say that you've done an excellent job with commenting on Hacker News. These are genuinely fascinating ideas and I have spent 3 hours today reading your comments. If more people left comments like you do then the Internet would be a better place.

I really enjoy listening to your podcasts. I always learn something from them. In the last episode you featured a new project that is very close to one I'm working on right now. It'd be great if you checked it out, because it has features the other project doesn't. Features that I think make it better and easier to use. But enough about that. Any news on the new podcasts?

That's how you do it. (:

Except you're clearly not spamming but actually referring to my work.. or are you! Oh, you win..

That's why I get paid the big bucks. =P

In all seriousness, I do listen to your podcasts. Proof: You forgot to renew the domains once, and joked about it.

Thanks on your marvelous posting! I actually enjoyed reading it, you may be a great author. I will always bookmark your blog and definitely will come back later on. I want to encourage you to definitely continue your great job, have a nice evening!

A work of a harvest,accurately because of your hard writing, we can feel so much eudaemonia, learn more our own understanding of their. The world could be so great....

Apparently some of these bots can't even properly place spaces after commas.

I didn't catch on to what you were doing at first. However if you look at qeorge's profile he actually has an 8.58 "avg" stat which is incredibly high and means that you must be right.

Come on. Are you calling Discount Oakley Sunglasses a liar?

Well, if you think that selling quality sunglasses at discounted prices is lying, then yes, they are liars. More so if you include that each and everyone of the sunglasses they sell come with a 90-day money back guarantee. On top of that they include a scratch-resistant coating that doubles as a shield against those harmful UV rays. But yes, liars. Indeed.

I felt out of the loop not knowing what Discount Oakley Sunglasses was...until I got a spam comment from them just now.

"This article made me become shiny. After reading this article, I learned a lot. I will follow your blog. I wish everyone like me herereap happy, bring in moved...."

I made them become shiny...wow surely this must be authentic?

Why does it matter if Discount Oakley Sunglasses really enjoyed your post or not? As long as the writer feels more motivated to continue writing afterwards, then what do you possibly achieve by telling him or her that "oh, don't believe in that compliment, it's CG'd".

Yes, I agree with you that it is a sad moment when you have to break through people's ignorance make them see the ugly side of the world, but it's sad because you played the demon and offered them knowledge, not because Oakley Sunglasses gave someone a CG'd compliment.

Well, it means the spam gets left online, which is why they're spooging all over everyone's comment systems in the first place.

Because the comment, when approved, can affect their website negatively.

If you are the trusted authority/advisor (for a mother/client/etc) then you should let them know.

If there is a worse thing than a blog with no comments, it's a blog with only spam comments.

They're the cobwebs of the internet.

I have a friend who works for a large marketing company where they write website content for companies that looks almost identical to this. The company hires hundreds of young writers at a time and pays them minimum wage to come up with a few of these every day. They pass their paragraphs around to other writers who reword it again so it will show up as a unique site for search engines. I call it the spam factory.

There even exists automated tools for these tasks such as 'The Best Spinner' (google it for more info) which will take in a chunk of text and spits out spun content. It mostly uses a database of synonyms to achieve this. Oh and also, if you care about quality, there is a manual mode where you put in your text and manually choose the suggested synonyms for words/phrases. The developer was clever enough to store what synonyms the users were manually choosing and now with that data, most of the automatically generated output would seem legible. Also, there are tools to scrape and spam these spun comments onto various blogging and forum platforms (google for 'scrapebox' and 'xrumer').

I'd also like to point out:

At this time there also exists computerized methods intended for these types of jobs like 'The Best Spinner' (google this intended for a lot more info) that may take in a amount connected with wording in addition to spits out and about spun articles. The idea largely uses a data source connected with word alternatives to achieve this. Also and in addition, in the event you love quality, there is a guide book setting in which anyone place in ones wording in addition to personally find the suggested word alternatives intended for words/phrases. Your designer had been intelligent enough to store exactly what word alternatives your end users have been personally picking and from now on to be able info, most of the routinely made productivity would seem legible. Also, you can find methods to clean in addition to spam these types of spun reviews onto numerous writing a blog in addition to community forum websites (google intended for 'scrapebox' in addition to 'xrumer').

I had no idea these things (from here: http://bestfreespinner.com) even existed until your comment. Now I just can't stop pushing song lyrics through.

What's the company?

With the right answer to this question, the original comment would need to be revised to "I have a friend who worked for a now-bankrupt marketing company until the Google anti-spam team found out about said company."

Nice try Matt "head of the webspam team at Google" Cutts ;-)

It could be this company:


For example:



Based in Las Vegas, Nevada we are a company looking for qualified individuals to help out with writing tasks such as:

- short stories

- movie scripts

- sales scripts

- articles

- news feeds

- PR websites

- blog posts



By commenting here you just proved that google has trouble identifying these kinds of sites algorithmically.

Or he's just curious who his adversary is. I would be.

Or he is doing his job!

Alternatively, that I have a sense of humor. :)

I love a good con, and it's instructive to teach people how they are tricked so that it doesn't happen to them.

One of my favorite spams happened a couple of weeks ago. The spammer said he had written a very long post about my excellent article, but his computer crashed, so he wasn't going to repeat it.

This is very similar to "Gee, if you could only have read my comment, you would have loved how on-topic and awesome it was!"

We've created this system where we are paying people to drop by our blogs and tell us how awesome we are. Strange world.

Isn't a trivial templating system like this a very easy thing to train a bayesian spamfilter on, even without the source template? I guess blogspammers mostly prey on entirely unmaintained sites for something like this to work.

But I'm going to have a hard time avoiding using "fastidious!" as a general expression of approval now.

I don't know how easy it is to train a spam filter on, but nested spintax quickly scales to millions of unique output texts. It can be spun at the paragraph, sentence and word level. It's really nasty stuff.

Sure, there can be millions of unique texts, but they all follow certain very similar patterns. That's what would make even a simple naive Bayes filter very effective against them.

I think you're misunderstanding how naïve Bayesian spam filters work. These techniques explicitly do not recognize patterns at all, they just look at vocabulary, and treat each word as a completely independent feature—so they don't even recognize "discount viagra" as a pattern, but merely two independent features "discount" and "viagra".

First of all, you can have any sort of features you want. So you could certainly look at things like sentence structure, pairs of words and so on.

More importantly, I used "pattern" in a very general sense: there is strictly more structure to a post generated with this system than a random post or even a human-authored one. So repeated words between posts are a pattern, and a very important one at that.

You can build a naive bayes classifier with n-gram features.

Naïve Bayesian filtering might not work very well on this kind of text. It basically looks like a regular comment, until you start recognizing that it always follows the same pattern. Your basic Bayesian classifier will throw all of the words in a set before analyzing them, which loses all of the information about patterns and word order. The resulting words are considered "independent" which means that even though the template might generate the words "pretty worth bloggers content online" every time it uses the first template, the naïve Bayesian classifier will never figure that part out.

My suspicion is that existing Bayesian classifiers have pushed the spammers to develop more natural-seeming templates, like this one.

You can run a Bayesian filter over pairs (or even triplets) of words (although this could cause the probabilities to be a bit off, because pairs like "although this" and "this could" are not truly independent). The downside is that as you do this, you drastically increase the size of the model and the amount of training data required.

Bayesian filters can also take more than just the words in the text into account - for example, they can take the submitting IP address (or perhaps /24 or ASN) into account, or a spam classification from external sources.

There are certainly better methods that could be built for recognising unknown templates - a simple known-state Markov model would be sufficient for the cases where templates substitute one word at a time, and you could conceivably use an unsupervised learning algorithm to discover an unknown number of models from a large corpus of comments.

Something like a Hidden Markov Model might be better then? That way you can can keep the information about word order.

It looks like this factors out to 4,351,250,624 unique comments.

How did you calculate this?

The set of strings is isomorphic to a Cartesian Product of sets of the same cardinalities as the set of options (for example, if the template was "{I,We} like {HTML, CSS, Javascript}", you can make a set {0,1}x{0,1,2}, where each element in the product set maps to one string and vice versa. For example (0,0) might represent I like HTML, and (1,2) might represent We like Javascript).

Because there is a one to one map (bijection) between the Cartesian Product set and the set of strings, the size of the product set is the same as the size of the set of strings.

The cardinality of a cartesian product AxB, where A and B are sets, is |AxB|=|A|x|B|, so to find the size of the set of strings, you just need to multiply together the number of options at each point in the template where you have a choice.

Multiply the count in each set, gotcha. But did he just count it for each set? It seems like that would be time consuming. Is there some shortcut? or did he just write a quick program to do it?

Ahh, you beat me to it.

It's called "Spintax". People in the internet marketing business use this all the time. Not just for comment spam, but to generate different versions of entire articles for submission to article directories and what not.

This is one of the more well known "spinners" as they are called: http://thebestspinner.com/

1 Million results on Google for "I have been surfing online more than 2 hours today, yet I never found any interesting article like yours" [1]

[1]: https://www.google.com/search?q=I+have+been+surfing+online+m...

Since that's only one of 16 combinations, I checked another. 7.7mil results if you change the 2 to a 3. Well then.

Here's a quick script that spins the text.

    gem install spintax_parser

    ruby -rspintax_parser -ropen-uri -e 'String.send(:include, SpintaxParser); puts open("https://bit.ly/Ziv9Aw").read.gsub("\n", " ").unspin'

Whoa, `spintax_parser`? ...are there legitimate uses for this?

Perhaps generating contents to use with a testing suite?

Man... I'd actually get to an ATM, but I took during supervised visitation to see if they actually had to talk to any judgment. I don't know what is written so we could draw the appropriate details. I had the Sponge Bob Lego set. Not only did they change the social worker again - if lowering the lien can be cancelled or reduced, Windows Vista would not let that get in contact with me. Will we ever know the rest? I told her my name when I signed in today... and then cashed it and put six graveyard cards down on them and that trying to get used to have the present ability to make a deposit. This is the reason it was too much time as an and the house was not there. Maybe whoever poured stuff on my answering machine fell... and now they can't even eat a slice without it tearing and falling to pieces. She is now one of my underwear that is being hacked into, they probably figured out how much it would cost for the attic that flew off the bus. It did install when I told her that she was first soldered, but me... how important a father to a building! I walked there from 2003! Yesterday, I called my lawyer and said that I will clean the apartment tomorrow.

Markov chain?

I'm going to make one when I get home today.

Oh god that is priceless. In the right hands you could create a regex that would block all spam from this guy!

It's an expensive operation, though. I doubt it scales, but I'm sure it's great for smaller services. WordPress and Disqus are probably going to use this somehow.

Disagreed. This is a super cheap operation to test

Nice, I should update my old blog post about this: http://denis.papathanasiou.org/2010/08/24/spam-apalooza-a-su...

This kind of thing really depresses me. Akismet tells me my blog [1] has had 1,364 spam comments. It's had 22 real comments.

[1] http://supplementsos.com/blog/

My site gets about 15,000 spam submissions a month. You are able to make a submission without being logged in but you have to signup as soon as you do for it to be published. Spammers don't do this, so the spam doesn't get published, yet they continue on regardless.

I also get spam signups who just signup for an account, some of the more sophisticated accounts will verify their email address but then they don't do anything. Very few ever come back again (if they do within a certain period of time their IP block gets extended).

For some reason, no spammer has ever put these two scenarios together to successfully spam my site. I wonder how they found it and why they continue to try, surely if you notice your spamming isn't working, you take the site off your list, or if you a really determined, try a bit harder.

Edit: What has annoyed me is, as soon as I became aggressive at tackling the spam (even though none ever got published) by 403'ing their IP's, they started spamming other sites with links to mine. Even with my explanations to Google about how it is not my doing, my site has been penalised, so I guess they win that way.

Remove any identifiable "footprints" from software you run on your server, like "Powered by wordpress" that should help you out...

Woot, you have 22 comments!

Why bother about the spam.

I especially hate the kind of spam you find in the comments over at rockpapershotgun.com, which looks similar, but is more annoying as it interrupts the comment tree:

Joshua. I just agree… Bonnie`s postlng is good, on sunday I bought a gorgeous Acura after bringing in $7140 thiss month and-more than, $10,000 last-munth. this is certainly the coolest job I’ve ever done. I began this seven months/ago and straight away was bringin in over $81 per-hour. I follow instructions here <url>

It was very pervasive for some months, but it looks like the guys finally found a way to block most of it. Shouldn't be too hard with alle the numbers and dollar signs.

What is the point of this? I thought the point of spamming was to insert links to your website or promote some product. This just looks like random compliments.

From http://codex.wordpress.org/Comments_in_WordPress

Depending on your site's settings, comments display slightly differently from site to site. The basic comment form includes:





I have the website field turned off -- it's not visible if you are a real user using the form, and "website" doesn't show up next to comments.

But I still get a lot of spam containing the website field; it seems like the bot is just automatically submitting POST requests with website included. This should be trivial to detect -- is there a plugin that just immediately deletes all of these out of hand before passing them on to an antispam tool?

There's a way of setting the 'hidden' property on an input field and checking if the field is blank.

If it's a bot sending POST requests, there's a high chance a "Website:" field would be filled out. Innocent users will not be able to fill out the field, in the case of a legitimate comment.

This is called a honeypot field. Basically spam bots are stupid and try to fill in every field (especially ones with phone, email, website, etc in the name). The honeypot is a hidden field that if filled out causes the form to not submit completely. There are some plugins for Wordpress that handle this for contact forms (Contact Form 7) and for comment forms. If you search Honeypot from the WordPress plugin directory, you will find them.

Also, many spammers hit the wp-comments-post.php file directly. Since they hit it directly there is no referrer passed like there would be if a POST was sent to it from a page on your site. This page: http://goo.gl/n5VHm on WordPress.com has information on code that can be added to your .htaccess file that will crush bots that POST to wp-comments-post.php with no referrer present.

I got this one myself the other day.

Have a {great|good|excellent|fantastic} day!I'm very {happy|glad|pleased} when see your post.I quite {agree with|endorse|approve of} your {point of view|viewpoint|standpoint|views on politics|opinion on public affairs}.I will continue to {focus|atte...

Oh Well! I just wrote CanOfSpam https://gist.github.com/mardix/5438589, a PHP implementation to get random text. It randomly picks a comment, and randomly picks optional text in the { } tags.

Add in dynamic content (like mentioning a blurb from a past blog post, or a tweet from the writer, or - even better - a tweet @ the author from someone else prominent) and you've got yourself one slick marketer... erm, I mean spammer.

Don't captchas solve this ? Are spam tools able to bypass them too often, or is it because people find captchas annoying ?

Even a captcha that can separate humans from machines with 100% accuracy does not defend against cheap labour ($1-5 / 1000 captchas) in developing countries like India.

Have others been seeing a huge increase in spam getting past akismet filter on wordpress lately, or is it just me?

As much as you guys are going to hate on my comment, well written spintax is an art form.

about this they use spamming software for internet marketing which works thou!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact