
Getting a full PDF from a DRM-encumbered online textbook - mr_tyzic
http://vgel.me/posts/cracking-online-textbook/
======
aftbit
I've done the same thing before on a book in college, except my DRM wasn't
nice enough to let me print 10 pages at a go. Instead, I spun up a virtual X
server with Xdummy with a resolution of 1200x10000. That showed a few dozen
pages at a time. Then I automated screenshots (scrot) and PageDown (xdotool).
Finally, some PIL magic to look for the thin gray line between pages plus
convert and ghostscript and I had a PDF!

------
pdw
Watermarks and such are easy to remove with PDFtk, the swiss army knife of PDF
files. Convert the input files to a plain-text representation, find the code
that implements the watermark (it will be the only code that's identical on
every page), delete it and convert back. Easy as pie. It will also concatenate
the partial files.

~~~
semi-extrinsic
How does that preserve things like layout or equations?

I've done similar (though usually less effort) textbook trickery a few times.
The Adobe Inept hack is very handy. Oh, and a recent one was stupidly easy:
you could view the ebook in your browser, and save excerpts as a pdf, but only
100 pages in total per book. Problem was it stored how many pages you had
saved in a cookie, so "Clear the last 5 minutes of browsing history" and you
could get another 100 pages, rinse and repeat for all the book and then staple
the files together with pdftk.

~~~
function_seven
I think the "plain text representation" parent is referring to is the
PostScript that defines the page. If the equations are rendered in PS, or are
inline images, they might survive the roundtrip conversion?

~~~
pdw
Yes, it's a jumble of PostScript fragments, base64-encoded images and PDF
metadata. Everything that's needed to reconstruct the original PDF, but in a
form that's safe to edit in a text editor.

~~~
scintill76
You might want to compress the PDF again when you're done (my understanding is
that part of what makes PDFs non-plain-text is binary compression encodings
within the PDF container.)

------
gohrt
DRM-laden online books is a red flag that the college you are attending is a
thinly veiled profit center, not an education provider.

Has anyone made an index of which colleges require DRM textbook purchases in
their courses?

~~~
spectralblu
I had encountered this in the lower div general education courses at my
university. The most egregious one was the professor who required us to get
the current edition of the textbook (since he would be assigning problems out
of the new book, which was just a shuffled version of the previous version)
that he wrote himself. Fitting that it was an economics course.

Once I got into the CS courses, most if not all of my professors just provided
PDFs of either their own material or some open source textbook they were
contributing to.

------
Johnny_Brahms
I have this idea that any material that uses DRM should not be covered by
copyright simply because it removes itself from things that will end up in the
public domain.

For me, one of the wonderful things about copyright is that works always end
up available for free to the general public. A DRMed work will never be free
in that sense, and should then not be covered by the regular legal
protections.

~~~
lispit
As long as Disney is around, nothing will ever enter the public domain again.

~~~
Johnny_Brahms
TTIP and similar deals fucks it up for the rest of us as well.

I say a maximum of 25 years free copyright (i'd rather see something like 5-10
years), and then progressively increasing fees that start becoming crazy after
something like ten years.

Then use that money to finance culture.

------
DanBC
I'm glad he did it, and published the result. But isn't cracking DRM a
criminal offence? Is it wise to confess to it in a public forum?

~~~
Johnny555
Doesn't seem like he actually cracked any DRM - he downloaded the book (as he
was apparently entitled to do under his license, 10 pages at a time) and used
an image editor to remove the watermarks. The digital equivalent of printing
it and using whiteout to remove the watermark.

I think it would be hard to prove that he cracked any DRM.

~~~
GPGPU
Did you read the same article I did?

He didn't download the book 10 pages at a time, and he didn't use an image
editor to remove the watermarks.

He wrote a script that simulated navigating through the book with a mouse and
keyboard and a browser, and generated a bitmap image of every page.

------
sotojuan
As someone interested in Clojure, it's cool to see stuff like this build with
it. Luckily I haven't needed to purchase expensive books since freshman year.
Everything I've been required to I either rent on Amazon for $30 or find
online.

------
mixedmath
I notice he said the final product was around 700 megabytes, which is a bit
absurd. What could he have done to make the final size more reasonable?

~~~
a_bonobo
The PNGs are insanely high resolution I think so that the OCR works better -
it doesn't say, but I assume the OCR'd book can use lower resolution.

