

Is what I am doing wrong? - cogdissneccgrit

I feel like I am doing something bad whenever I scrape websites. Is this guilt unfounded? Is what I am doing okay?<p>For example, I am learning Rails right now, and decided to memorize all the code listings in the Rails tutorial (http:&#x2F;&#x2F;ruby.railstutorial.org&#x2F;). If I wanted to, I could get all the code listings in that book within the next 5 minutes by using a scraper. Something that would take an hour (if I were disciplined enough to withstand the tedium of copying and pasting for an hour without breaks) would take just 5 minutes.<p>I love writing scrapers, it&#x27;s one of the most fun things I know I can do with a computer. But after everything is said and done, with the data that I want on my hard drive, I can&#x27;t help but feel guilty for doing it. I feel like I&#x27;m cheating something out of somebody. Especially in this case because I know how helpful Mr. Hartl has been to the Ruby community with his book. Mr. Hartl has probably jumpstarted thousands of careers with that book.<p>So anyway, is what I am doing wrong?<p>-----------------     
By the way, just in case someone is curious, when I say &quot;memorize&quot;, I don&#x27;t mean word-for-word memorization. And I am doing this because I usually start off by rote learning whenever I try to acquire new skills.
======
PaulHoule
no

~~~
cogdissneccgrit
What if I were to make a website that organizes the data that I've scraped.
Would that be okay as well? Under what conditions?

~~~
smacktoward
Scraping a copy of a website _for your own personal use_ is fine. (People used
to do that all the time, back in the days of slow and unreliable modem
connections; see
[http://en.wikipedia.org/wiki/Offline_reader](http://en.wikipedia.org/wiki/Offline_reader)
.)

Scraping a copy of a website _and then republishing it_ is different. Whether
it's OK or not depends on the license under which the site is published:

\- If no license is specified you should assume that the site's creator's
retain complete copyright control of it, which means you'd need to ask them
for permission before you could legally republish it. They can charge you any
fee they wish for that permission, or require you to comply with any
conditions they specify. You can negotiate with them on these points, but
you'll need to get their approval before you can act, so they have the final
say.

\- Some sites are published under permissive licenses like the GNU Free
Documentation License
([http://www.gnu.org/copyleft/fdl.html](http://www.gnu.org/copyleft/fdl.html))
or the licenses provided by Creative Commons
([http://creativecommons.org/licenses/](http://creativecommons.org/licenses/)).
Sites that use these licenses will generally carry a notice to that effect
somewhere. These licenses can allow you to republish stuff, as long as you
follow the terms of the particular license; republishing a GFDL document would
require you to make your republished version GFDL-licensed also, for instance,
while republishing a document licensed under CC's Attribution license would
let you republish as long as you include a credit to the site where the
material first appeared. Check the particular license for the site you're
looking at for complete details on what the terms are.

\- Some online documents, like (for example) e-books from Project Gutenberg
([http://www.gutenberg.org/](http://www.gutenberg.org/)) are just copies of
material that has passed into the "public domain": see
[http://en.wikipedia.org/wiki/Public_domain](http://en.wikipedia.org/wiki/Public_domain).
Public domain documents are documents on which the copyright has lapsed or
expired. These can be republished in any way you wish without having to comply
with any limitations or terms. Due to changes in copyright law, it is
extremely rare to find documents published after the late 1920s in the public
domain.

As noted above, unless the site explicitly says it's published under a
particular license or is demonstrably in the public domain, you should assume
it's not OK to republish anything from it without permission.

EDIT: In the specific case of the Rails tutorial you cited, the author
helpfully spells out how the material is licensed, here:
[http://ruby.railstutorial.org/ruby-on-rails-tutorial-
book](http://ruby.railstutorial.org/ruby-on-rails-tutorial-book)

> _Ruby on Rails Tutorial: Learn Web Development with Rails. Copyright © 2013
> by Michael Hartl. All source code in the Ruby on Rails Tutorial is available
> jointly under the MIT License and the Beerware License._

In other words: the _text of the book itself_ is copyrighted, with the
rightsholder being Michael Hartl. So you'd need his permission before you
could republish/remix/whatever it. The _source code of the coding examples_ ,
however, is published under the more permissive MIT and Beerware licenses. So
you could republish _that source code only_ without needing his formal
permission, as long as you do so in a way that complies with one of those
licenses.

~~~
cogdissneccgrit
Thank you for your helpful reply.

If I do decide to do this project, then I will make sure to credit him.

~~~
smacktoward
No problem... but based on your response I'm not sure you understood what I
was saying :D

He holds the copyright to the book's text, and didn't license it under a
permissive license, so _crediting him is not enough._ Copyright means you need
to get _his explicit permission_ for whatever it is you want to do. If he
wants to charge you money for that permission, or require you to follow some
set of conditions he specifies, it's in his rights to do that.

In other words, the next step would be to contact Hartl, describe your project
and ask him to give you an OK, preferably in writing. Without that OK you'd be
violating his copyright. Which isn't an academic matter -- if you violate his
copyright, it would be trivial for him to get your project taken offline by
your Web host just by filing a DMCA complaint. That's something you can do
with a simple email to the hosting company. So you want to be sure you're in
compliance to avoid pouring a lot of time and money into something that gets
nuked five minutes after you publish it.

