

Watch us crawl the news in realtime - jbr
http://newsbasis.com/news

======
jbr
You can read the motivation for this technology here:
[http://ceo.newsbasis.com/try-newscatcher-our-built-in-
real-t...](http://ceo.newsbasis.com/try-newscatcher-our-built-in-real-time-
media)

It's built on rails, node.js, resque, and solr

------
mikecane
Would be nice to see a list of sources. When I first tuned in, I didn't
recognize any of them, then the MSM started to come in.

~~~
jbr
Yeah, I understand. I think at this point we're considering the list (~10k
sources) a business asset.

~~~
nolite
Is it possible to give us a rough overview of your tech infrastructure?
(Servers, processes, storage)?

~~~
jbr
Sure. This probably will turn into a blog entry, but here's the gist:

    
    
      4 medium sized linodes, divided as follows:
      1 app:    unicorn (rails), nginx
      1 db:     redis, solr, mysql
      2 worker: resque workers (both ruby/rails node.js) & the crawler
    

The crawler is written in node.js, backed by redis. When it finds a new page,
it downloads it to shared local storage and adds a task to a resque queue
monitored by the rails workers. They add a row to a mysql table that
represents the permanent record of the page, use nokogiri to extract the body
content and any metadata, index it into solr, delete the local copy, and
upload the page to an s3 archive. When you request the page, rails asks solr.

~~~
nolite
Nice, thanks..any stats on how fast it is? It looks pretty fast from the web
page update

~~~
jbr
Haven't built in much monitoring yet, but watching the resque web interface,
most of the delay is in actually finding the new page - from there to it
showing up on your screen is no more than a second or two. For subscribed
users, we do email notifications and we almost always beat google news alerts,
often by around 15 minutes.

~~~
mikecane
I have it bookmarked and find myself peeking in at least once a day during
other news downtime.

