

I have 6.000.000 RSS articles in a DB. What can I do? - robertpohl

I created a website a couple of years ago (thatstoday.com) where members can add RSS feeds and read them in our "news reader". The years went by and we've indexed all feed articles in our db (and in Lucene).<p>Now I'm not sure what to do with all the data since the db it's getting large and difficult to handle.<p>Should I delete it or create a new service? What do you guys suggest?<p>Thanks,
Rob
======
gbog
> the db is getting large and difficult to handle.

Did your store the raw content in the database? If so, you might consider
writing files instead. These blobs, like pictures, are better stored as files.
In your database, you should keep relational data: url, time, adder, etc,
properly indexed (probably by adder, time and maybe keywords). Then a 6M rows
table is quite a small thing for any RDBMS (if your SELECTs are filtering on
indexed columns).

~~~
robertpohl
Good idea! When i'm importing the raw data I add it to lucene to make similar-
articles-matching, but after that it could be dumped to text.

------
bulte-rs
You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-
this" recommendation engine. Perform some - mentioned in another reply -
n-gram analysis on the corpus; do some basic cosine-similarity analysis with
the feeds people subscribe to and see what pops up. Try other techniques (from
e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate;
analyse results; publish.

At least you'll have fun (YMMV)...

[1] <http://www.icwsm.org/[2007-2011]>

~~~
robertpohl
I already use "find similar articles" techniques using Lucene. But I'm
thinking it might be more interesting in another more data-centered website?

~~~
dodo53
Does your reader record which items in a feed were read? I'm sure lots of feed
providers would be interested in metrics on which of their articles were
frequently read (and possibly some clustering of interests based on what else
readers subscribe to)

~~~
robertpohl
Yes it does! The problem is the low activity on the site. If the traffic grew
the meta data would explode ;)

------
xSwag
You should dump a copy of the database somewhere so we can all take a look at
it and perhaps analyse it.

~~~
robertpohl
I'd love to do that, but how can I finance storage and bandwidth? Paypal
donations?

Hmm.. I know a few peeps at MS/Azure.. maybe they can sponsor..I'll get back
on this one.

~~~
kip_
Torrent that. Exactly the kind of thing that BitTorrent is for.

~~~
vladiim
+1

------
drewcrawford
There's no e-mail address in your profile. Please contact me at the e-mail
address in mine.

~~~
robertpohl
I have an encrypted version there now ;)

------
huragok
Sell it or lease it. I'm sure there's some value in the aggregation of feeds
(though I can't imagine what besides user habits).

------
haddr
For God's sake don't delete it! First of all dump it to files instead of DB.
(Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is
actually non-relational, I suppose) Second of all: is your data clean? If not
then you might need to clean it from any boilerplate (such as HTML code) Then
you can process it with some tools. There are some good NLP tools available
such as Gate. You may have a look at them. You can do great deal of things
there: \- detect some entities (companies? products?) and do some
classification of documents \- you can detect some events (iPhone
announcements, etc) \- if you have time & date (hope you have) you can do some
trending topics analysis (what was hot in June 2010) \- probably you can't
sell the data as the content of articles is not yours, but you may sell some
derived data (analysis, etc)

------
raverbashing
Natural Language processing n-gram studies Or just some kind of webservice
where people can look this up

~~~
robertpohl
Thought of that, but I need to add it to a 50GB+ cloud SQL storage which cost
a few bucks... :)

------
spobo
Write an algorithm to match articles from different sources that are about the
same story. That way you can autohide news that you've supposedly already read
from another source. Clustering news from different sources would be a killer
feature for me :)

------
waxjar
Depending on how far back the data goes, you could try to spot language trends
trough time or make a pretty graph of the average article length trough time.
I'm not that inspiring.

I hope this'll get a follow up, I'm sure someone can think of something
awesome to do with it.

If you decide to open source it (copyrights?), use Bittorrent: it's the
perfect tool for this job.

------
kngspook
How big is the DB dump?

I know I would be interested in downloading it, and just poking it for
interesting stats...

~~~
robertpohl
50-60 GB

------
ramigb
I suggest you change your privacy policy (i am sure you have one) and tell the
users that you will open the database for researchers, and make the open
database a donation based service, this way you can pay to host it and might
get some good extra money.

~~~
robertpohl
Sounds tempting!

------
bromagosa
Are they all open? If so, I'd contact wikimedia and see what use can they give
them.

------
rocky1138
Do you have the rights to distribute the articles in your database?

------
evanwolf
Donate it to <http://Archive.org>. They can reconstruct disappeared web sites
from it, preserving the web's past.

------
tzaman
If you think it could be useful to someone, try to sell it

~~~
robertpohl
How much do you offer? ;)

~~~
tzaman
I'm not that someone :D

------
ahmedaly
Maybe I can offer you hosting for free if this is your problem.. pls email me
on ahmed(at)svwebdev.com

------
gauravvijay
I can sponsor the S3 storage but with limited IO

~~~
robertpohl
I guess it would need more that just storage, but also an API so you can query
it. Maybe a JQuery/JSONP layer ontop of a Lucene index?

------
aw4y
make it open!

------
lucamartinetti
a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and
make it public

~~~
openmosix
love it...

