

Ask HN: What should I do with 4 GBs of form spam? - arnorhs

I've been collecting form spam for around two years, during which I've collected around 6 GB or 3.7 million submissions, on a website called http://spambotlove.com/<p>At first I wanted to use the data to generate images or to display statistics, but I don't really have the time or will because I'm very busy with other stuff.<p>What should I do with it? I guess I could just delete it..?<p>Would it be of any value to anyone if I would make a simple API to query the DB?<p>Any ideas?
======
pedrokost
Instead of deleting it, you may want to share it in a torrent file. Maybe
someone could make some better use of it.

~~~
nyellin
I whole-heartedly agree. You don't need to do anything yourself, except for
releasing the data. Please don't, under any circumstances, delete the dataset.

Edit: Regarding your original question, I think that a torrent would be much
more useful than an API.

------
kephra
Install Weka (requires Java), do a string to wordvector conversion of your
data, and compare SPAM and HAM of your blog. Play around with machine
learning. /join #machinelearning @ freenode <\- some of us are using Weka

~~~
khandelwal
If you don't like Java, you can use Python with the Natural Language Toolkit
(NLTK) and do the same thing. Please _do_ provide the dataset for download.

------
nyellin
You should add the "rel=nofollow" attribute to the links that you show on your
website. You can damage your search engine rankings with spammy outbound
links.

------
arnorhs
I had actually asked about this very same issue a year ago, but the database
was much smaller and I didn't really receive any breathtaking ideas.

Also - wtf of the day:: Last submissions was _exactly_ a year ago!
<http://news.ycombinator.com/item?id=1147085>

~~~
JacobAldridge
Close 356 days, not 365, though still close enough to be a note-worthy
coincidence!

------
windsurfer
It would be neat to generate an image tracking similarities between spam
messages. Sounds like a nifty project in R.

~~~
arnorhs
That's true. And even visualizing how spam has changed from two years ago.
Choice of words, length, amount of links etc

------
sagacity
Please consider doing at least the following:

Generate a separate file (csv/sql or something) with the structure:

IP Address, number of submissions

and make it available for download.

I guess people could use it to flag probable spam.

