Hacker News new | past | comments | ask | show | jobs | submit login
Harvard Law Library sacrifices trove of legal volumes to digitize them (nytimes.com)
70 points by wellokthen on Oct 29, 2015 | hide | past | favorite | 31 comments



I work at the Harvard Library Innovation Lab with the folks who are making this happen. Super excited that it's finally public. If anyone has questions I'm happy to dig up answers.

Here's how big this is: we don't even know yet how many cases we'll end up with, to within the nearest million.

PSA: we're hiring a devops engineer[1]. In addition to building amazing tools to access all this data, we're running a distributed linkrot preservation service[2] that after just two years is in use by 40% of American law schools and 10% of state supreme courts; an open-textbook-as-forkable-playlist[3] tool in use at Harvard Law and a half dozen other law schools; and a research project on distributed encrypted library archives[4] for preserving high-value cultural records. We're basically the alien in the brain of a 200-year-old library -- it's a fun place to work.

[1] http://librarylab.law.harvard.edu/blog/2015/10/20/hiring-dev... [2] http://perma.cc [3] http://librarylab.law.harvard.edu/projects/h2o [4] http://librarylab.law.harvard.edu/projects/time-capsule-encr...


Thanks for stopping by! Do you happen to know what type of license the materials will be under when they open up access?

I am an English teacher, and about 80% of my students are professional lawyers, so I wonder how free I will be to use this material in my classes. I already use Harvard Law School's free case studies, they're great, and under CC license if I remember correctly.


Bottom line: everything becomes public domain after eight years. Before then, you'll be able to search/view/download up to 500 cases per day through either a web interface or API. As far as I know there's no licensing on the individual cases you download.

For academic researchers, before the eight years are up, we can also provide a full data dump -- you just have to sign an agreement not to redistribute bulk data.


Is anyone considering if there are metadata issues that may be lost somehow through this process and that should be included through some sort of coding process? I don't have an answer, but I am wondering essentially if there is a deliberate effort to circle the question "Are we missing anything?"


So we're storing non-compressed 300dpi color scans of every page, as well as the original shrinkwrapped books out in a saltmine somewhere -- we're not losing any data.

There's a whole separate problem of turning those scans into a high-quality data set. The first pass will be decent-but-not-perfect OCR of the full text (with page-location data, like Google Books), plus human-checked metadata for stuff like case name, judge, and date. Since we have the original scans as well, there's lots of room to iteratively improve the data conversion from there via ReCAPTCHA and the like.


"You can imagine the way your heart skips a small beat when you put a book under a chopper like that," he said. After the volumes are scanned, workers reattach the spine to the pages, encase the book in shrink-wrap and, he said, "put it back in the depository for the apocalypse."


Vernor Vinge's Rainbows End has a massive book-scanning project which consists of dropping books into an industrial shredder and then blowing the shreds through a long tube lined with cameras. The little fragments of imagery are then stitched together using sufficiently advanced software.

This isn't there yet, but I wonder if that's where we're heading...


Vinge's book scanning reminded me a lot of DNA shotgun sequencing(1). I wonder if that's what he modeled his method on?

(1) https://en.wikipedia.org/wiki/Shotgun_sequencing


Too inefficient. Opening up the books and slicing off the pages cleanly would be more effective, and we already have age-old matching techology from bill counters and card sorters.


Why do you call it efficient? They were scanning a book every few seconds! While cutting the spines off and running the pages through a sheet feeder requires sustained human interaction for every book.


> They were scanning a book every few seconds

What's the hurry?


IIRC that was part of the plot too. There was no string reason to prefer this technology, except the company that owned the destructive-scanning IP was pushing their solution before the other ones matured enough to be cost-effective.


Not speed-wise. Complexity-wise. And there's a real risk of mismatched pieces. They'd all have to have predictable unqiue shapes otherwise.


In the book, one character explains that tearing rather than cutting the books is specifically to get unpredictable unique shapes along the tears, to make matching for reconstruction possible...

There will be some loss, true. Even where everything is properly photoed, the programs will make some mismatches. Potentially, the error rate can be less than a few words per million volumes, far better than even hardcopy republishing with manual copyediting. -- Sharif, Rainbow's End p 129

But of course this is fiction. The parallels to what Harvard are doing are obvious, but I don't think anyone would seriously suggest building the Libreanome project in the real world.


Harvard Law Library sacrifices trove of legal volumes to digitize them

Please don't editorialise submission titles [1]. The title of the NYT article is "Harvard Law Library Readies Trove of Decisions for Digital Age".

[1] FWIW, I agree with the sentiment of loss implied by "sacrifices". On a worse day than this I might even think of these books as being mutilated.


Eh, its an overly sentimental way to look at things. Yeah on the surface it sounds like these books are being damaged, but in reality they're being given new life. Before you'd have to be physically present at Harvard to access these tomes, but after digitation they'll be available and searchable by anyone over the world. People might actually stumble upon the information via a related search instead of having to know of the existence of the specific document. This also increases their shelf life dramatically because they'll be spread digitally instead of being at the mercy of one fire or flood.


The URL says "Harvard Law Library sacrifices a trove for the sake of a free database", and the title of the print article is "Sacrificing a Legal Trove for the Digital Age", so perhaps the submitter did not editorialize the submission title. Maybe the online title has just changed since submission.


I'm curious as to why they need to destroy the books? Remember Google's open book scanner?

http://linearbookscanner.org/

https://code.google.com/p/linear-book-scanner/


Presumably they found it more important to preserve the pages than to preserve the volume binding. The FAQ for the linear book scanner indicates that Prototype 1 mangles pages in some way (tears or folds) in about 45% of books it scans.

Moreover, it used to be common to create bound volumes by binding multiple issues of a serial together. (In some cases the bindery would crop pages to fit!) The binding is often so tight that the volume cannot open flat enough for a full scan. Separating the pages from the spine allows for the entire page to be imaged without distortion.

Edit: Incorporated correction. I had accidentally stated the claim more strongly than the FAQ supported; however, my point does not depend on the claim being as strong as the form in which I had stated it.


Correction: it mangles one or two pages in 45% of books it scans.

    Prototype 1 could scan the majority of books without 
    damage, but may tear one or two pages in some books. Out 
    of 50 books tested, 45% had one or two of their pages 
    either torn or folded. This is a very early prototype and 
    there are many areas for improvement in the design.
http://linearbookscanner.org/faq/


Incidentally, 45% of 50 is not an integer. This makes me wonder how many books were actually tested. Or were they just rounding up from 44%?


It sounds like that machine gets one page every 11 seconds -- maybe 360 per hour? We're getting more like 8,000 per hour.

The rule of thumb apparently is, best case, books you can't cut the binding off cost 5-8x as much to scan as books you can.


Probably because the books are far less important than the information.

That being said, they are probably throwing away an opportunity to sell the books as novelty or decorative items. But who knows, maybe the cost-benefit is just not worth it.


If you want to search court cases today for free, check out https://www.courtlistener.com , part of http://freelawproject.org


Those guys do really good work. It would be nice if this effort (the Harvard one) coordinated with the Free Law Project.


Rather than sacrificing the volumes, they are sacrificing the bindings. The re-bound pages themselves will likely be much better preserved in archival storage than they would be in human-accessible, human-handled form on the shelves.


I'm wondering at the actual value of old cases. Surely there's a halflife for law. This worship of precedent as semi-holy writing is strange to me. As an Engineer, old books are far less useful than new ones. Things change; people change; law changes too. Maybe its just fine the books were mutilated for this process. Its a way of 'burning bridges' and putting it all behind us.


Old cases are valid so long as the laws haven't changed, there hasn't been a new ruling that overrides that, and the situation is similar. Judges aren't completely bound by precedent either if the situation means it shouldn't apply they're able to rule differently and explain why in their judgements. The concept of precedent is really important in law because of all the vagaries and rules surrounding laws and how their applied, it also supports the even application of the law so that laws and their effects are more predictable.

It also makes cases easier for to argue by mapping out the corner cases of the law and legal concepts and keeps every lawyer from having to be an absolute expert in every facet of every case they're litigating. The same applies to judges.

Precedents do have a limited lifetime but it's not measured in years, if someone is being tried under the same law as existed 100 years ago old rulings can be just as relevant as one made last year.


> As an Engineer, old books are far less useful than new ones.

This isn't always true. Many of the fundamentals of civil engineering, mechanical engineering, mathematics, power engineering, chemistry, physics, even computer science, are not supplanted by more recent texts. In fact, when I was a computer science and math student, I lamented many of the modern texts (math and physics ones being the most egregious) for their overly fluffy delivery, compared to my father's and grandfather's and mother's math textbooks from (at the time) 20 and 60 years earlier. I would use theirs to study because they were much better at conveying the topics than the recent calculus or statistics or whatever book we were given.

Now, in industry, and in the software industry in particular, this is more applicable. So much is changing so fast, but the basic concepts (functions, types, network theory, latency versus throughput, even multi-process computing) are the same, but the technologies we use and how they present or make use of these fundamentals change rapidly so my 199x book on javascript is only barely applicable to what we have today.


You would be surprised at how relevant some of the old case law is. Mainly because it goes straight to the heart of certain issues - i.e. What does it actually mean to 'own'something. These fundamentals don't change, but of course, this modern world does put stress on some of these principles, for example have I really stolen from you if you still have the item? (Digital copying)


I have the good fortune to have worked in the same office as Alex Gulotta my senior year at an internship. This was before he left to work at the Bay Area Legal Aid but he was a huge presence and boost for the office in Charlottesville.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: