

Downloading the internet, or how we got our first 1M articles - Swizec
http://preona.net/2011/01/downloading-the-internet-or-how-we-got-our-first-1m-articles/

======
msredmond
I think I'm missing something on as far as the copyright issues of this
project....

You say that you're downloading the entire internet for content, and that
you'll be serving it up to your readers directly for the various reasons that
you site through your app.

If you're linking to the content, no issues. But this doesn't sound like
linking: you're stripping the content, and now you're serving it to your
readers yourself, so isn't that taking the content from the provider and using
it for your own purposes? Are you only doing this with providers that gave you
permission to do so? And if not, is there some sort of fair use clause that
you see applies here?

Just very curious as to how you see this model working with copyrighted
content (and/or definitely let me know if I'm just missing the model
here/misreading what was written in your post...end of a long day! ;-)

~~~
Swizec
We are serving the content ourselves, but we fully attribute it to the source
and provide a link to the original website. We aren't doing anything much
worse than for example the browser cache does when loading up a cached
website.

~~~
Andrew_Quentin
If that is all you are doing, even if it were not legal, I do not think anyone
would have any worry, but you mention that you might want to take away the
annoying ads from the content, which, changes everything completely and
probably makes it illegal.

~~~
nolite
Sorry, but people love to throw the "legal" word around loosely alot. Lets be
specific here? What "law" would this be breaking? Who would enforce it, and
under which jurisdiction? What case has set a precedent? If this is illegal,
then what about search engine caches, the wayback machine?

~~~
coderdude
>If this is illegal, then what about search engine caches, the wayback
machine?

The difference between what search engines/the Wayback Machine do and what a
service like his does is that the former caches more of less complete copies
of the content without the intent to strip anything out (save for the view as
text only feature in Google's cache), while his service is stripping out the
monetization schemes of the content owners.

Although he links back to them (which he absolutely must do anyway), the
content owners might become angry that they aren't getting the traffic and the
chance to have their ads clicked. I think that is a reasonable viewpoint for
the content owners to have, yet I also want to believe that a service like his
should be able to exist. Perhaps some kind of middle ground can be achieved
where he does show the ads but perhaps well organized at the bottom or middle
of the article or something.

What would the content owners be able to do though? Maybe send DMCA notices.
This would fall under copyright infringement laws I think. I'm no lawyer
though. I also think there should be some kind of allowances for these kinds
of content repurposes.

~~~
msredmond
I think the big issue here is that if you own the content, and the other
person is making money off that content without your consent, that's a problem
(as in potential liability). One way around that is to make absolutely no
revenue from it -- some publishers may not care. I don't know that would solve
any problem, but do know that's worked for others (and in fact it's rumored --
or maybe even confirmed -- that that's the reason Google doesn't run ads on
Google news, and for the most part they're not even using the full story)

------
coderdude
I always like to read about things like this -- but if they think 1MM pages at
25gigs and 0.8 requests/sec is a lot then I can't wait for their service to
get more popular. It's similar to someone saying they have a huge database
with only a few million rows in it. They do have a good site design though. I
wish them the best of luck. It only gets more fun from there. :)

~~~
Swizec
It is a lot in the sense that we're footing the bill for handling all of this
data ourselves ...

Also a million is such a nice milestone, I just had to write a blog about it
:)

~~~
coderdude
From your article: our scraper just isn't all that amazingly fast. It takes
around two seconds to extract the meat of the article out of a decently sized
and reasonably complex website.

Since you're on AppEngine does that mean that your code to extract the article
content is written in Python? If so, consider using the Python port of
Readability.js: <https://github.com/srid/readability>

I haven't tested this one personally (yet), but before it came out I ported
Readability to Python and it was very quick, less than a second to extract the
content. I used lxml to parse through the X/HTML. You'll have to do a lot of
tweaking to get the heuristics working more or less perfectly but it would be
well worth it.

(The following assumes you're downloading and processing the content using
AppEngine) As for bandwidth concerns, you will probably want to just get a fat
pipe into your place of business (or wherever you're operating from). Where
I'm at it's only $100/mo for a 16mbit/down line. That would handle your growth
for quite awhile. Once the content is all downloaded and extracted locally you
can push it up to your AppEngine server.

~~~
Swizec
Actually we have tried python-readability. The problem is if you want to run
it on AppEngine you can't use lxml (no C code) so you're stuck with
BeautifulSoup. It was also hellishly expensive, I think that experiment cost
us ~$10 an hour in CPU costs.

As a result, it's better to stick with the javascript version of readability
and run it in node.js. It's perhaps a dash slower than a python version would
be, but the results are much much better and it's easier to keep up to date.

~~~
coderdude
Wow... that's pretty expensive. Better to just do all this stuff locally and
save your money for serving up requests to your users IMO. Kudos for setting
it up to work with node.js though, very neat.

------
bambax
Scraping from a central server has two problems:

1) you have to pay for it

2) you can be blocked by the content providers

Would it be possible to leverage your end users instead?

While your users read articles in your app, the app probably does little;
couldn't it use that (mostly idle) time to scrape other articles, and send
back to your central server the scraped/cleaned content? Then all the server
does is store and serve.

(Of course you'd have to warn users and check for the type of connection (only
do this when the user is on wifi?) but it would be quite legitimate for your
app to download content that has actually been requested by the user).

~~~
iflyplanes
I think the best way to do this would be to inform the end users of the goal
(to lower costs and provide better availability), then allow them to turn on
the feature if they wish. A checkbox in the settings/preferences pane and
maybe a popup option box on first startup could be enough to convince a
significant number of users to help out.

------
benologist
"Then everything promptly went to shit. We ran out of allocated budget on the
AppEngine, our scraping architecture melted and just about everything that can
possibly go wrong server-side did go wrong."

heh, it's always fun launching.

