Harvard Converts Millions of Legal Documents into Open Data (govtech.com)
331 points by profquail 10 days ago

Hi! I'm a project dev. In case it's helpful, there's lots of Q&A in this previous thread: https://news.ycombinator.com/item?id=18330876

I'll try to answer any questions that pop up here as well.

Which OCR did you use?

ABBYY FineReader -- I don't have the specific version in front of me unfortunately.

In addition to the structured text we're currently serving through the API, we also have 300DPI color scans and per-word coordinates and confidence scores, so there's a lot more we can do with the OCR data that isn't exposed yet.

For those docs that require more finesse (This is OCR :D), was there a massive manual effort involved?

And arron swartz is dead. Not sure how many of the docs made public are ones that we was trying to hack in to. That incident will always leave a bitter taste in my mouth.

None. He was distributing copyrighted articles from JSTOR. These are court cases which legally cannot be copyrighted (with any copyrighted annotations in the paper copies properly redacted). The two situations have nothing in common.

Heh. And yet case.law has a 500 case download limit except those who have been approved by research agreements that are mandatory by lexis nexis. I've been waiting since release night to get a research license approved. Granted, it's been less than a week.

What do you think would happen if somebody with a research agreement downloaded everything and released it all to the public? You'd probably then find that the two situations are very similar.

To be clear, the download limit is imposed by Ravel, the startup that paid for the effort of redacting copyrighted annotations from the scanned images. Lexis later purchased Ravel, but the download limit comes from Ravel’s control over the redacted digitized images, not Lexis’s copyright over the source materials.

Now, what would happen if you breached the research agreement? You’d probably be sued for breach of contract. But that’s not what case.law is doing. It’s abiding by everyone’s legal rights: the original publishers who collected, archived, indexed, and added annotations to the case law, and the folks who helped digitize versions that could be freely distributed. The government paid for the courts so the cases are free, but it didn’t pay for all those other things and they aren’t free.

That's pretty tautological, isn't it?

> If someone did a very similar thing here to what Swartz did with JSTOR, this situation would then be very similar to what Swartz did with JSTOR.

No, the question is asking whether a similar act would result in a similar outcome; whether there is a crucial difference in how the two perceptibly similar things would interact with law.

Analogously, one could ask what would happen if someone avoided their taxes vs. evaded their taxes. Both could be seen as morally the same act, but the legal consequences are different.

Swartz did work with PACER before the JSTOR incident. His first incident with an FBI agent was because of the work he did with PACER.

The FBI invesitated and then did nothing, which was the right course of action. PACER is a government database intended mainly for litigants and the courts. It is essential to the functioning of the court system,[1] and it was proper for the FBI to investigate any unusual access. It was also proper when the FBI declined to press any charges after they concluded that all he was doing was distributing the documents, which were public and not subject to copyright.

[1] PACER is literally a read-only view into the same databases courts and lawyers use to file documents and orders in cases. Some people want a mass-publishing system for court documents, and maybe we should build such a thing. But calls to abuse PACER for that purpose are just an end-run around the political challenges of getting the government to spend public money building such a system.

> Some people want a mass-publishing system for court documents, and maybe we should build such a thing.

If I understand you correctly there is such a thing. https://www.courtlistener.com/recap/ is a public archive populated by browser plugins by paid PACER users.

RECAP is a hack and depends on someone to have accessed the document in the first instance. It's not true public access.

The issue is that PACER is designed primarily for attorneys. That's why the usage fees are so high--it's a basically a tax on attorneys that goes to funding the operations of the courts. (Pro se individuals are entitled to receive filings in their cases for free.)

The open access folks have a legitimate point that PACER makes it hard for the public to access those same documents. But the solution to that isn't to abuse PACER. If we think everyone should have free access to these documents,[1] the solution is to build a website where these things are published. And, since that would undercut the value of PACER, arrangements would have to be made to replace that revenue with general appropriations.

[1] Note the reason we would want to do this is that these are public records, not because they constitute "the law." Court opinions with precedential value are already published on courts' websites in PDF format. What PACER contains is everything else.

From what I understood he only downloaded docs he had access too but did not distribute them.

> And arron swartz is dead. Not sure how many of the docs made public are ones that we was trying to hack in to. That incident will always leave a bitter taste in my mouth.

Makes me wonder how much overreach goes on and we don't even know about it...

Ortiz and Heymann are definitely horrible people in my eyes but we have to think of them as responding to incentives. They saw they had an opportunity to pad their numbers and went for it. I don't think we have done anything to fix the core issue, which I think is how do we judge the performance of a prosecutor?

I mean, isn't morality all about doing the right thing in the face of perverse incentives? If status gains can be used to excuse anything we do, where will it all end?

With better incentives hopefully. Nature has no care for morality.

If you find a prosecutor in nature, you should contact Sir David Attenborough. That would have the makings of an extraordinary documentary!

Humans do. Humans are the only species with the capability to transcend their nature.

But do you think they have this ability ... by nature ? Which makes it human nature ? Which is thus not transcended ? This can be an argument to say that having the innate sense of good vs bad (if that's a thing) can have a species incentive, not an individual incentive. which we share with many species, come to think of it.

None, probably. The effort here seems to involve mostly documents previously only available on paper.

This is very exciting! I'm looking forward to digging into this info.

I'll read a bit more on the site, but offhand does anyone know if this is an ongoing effort, in that new (2019 and beyond) cases will be brought in as well?

For now we're stopping with volumes published up through June 2018. The Free Law Project (https://free.law) is a good source for cases published after that, although they're mostly unofficial versions scraped from court websites.

We're hopeful that all courts will switch to official digital-first publishing over the next few years, as a few courts already have. Once the transition is complete, it might make sense for us to go back and fill in the gap volumes.

FWIW: We offered many years ago to pay PACER and various states to do precisely that (move to digital publishing, and make it all open).

They were all unwilling. PACER is at least somewhat reasonable, since we did not offer to pay them the 145 million a year they were making at the time, and they felt Congress would kill them if they gave up that revenue source, which is probably not wrong. (the others, what we offered was much more than they were making).

So I'm not sure why you have such hope.

For all states hat have moved to digital first publishing, just about all of them have struck agreements with lexis/etc whereby they have token "free access" sites and the data is still otherwise locked up.

> I'm not sure why you have such hope.

Hey, I work at the Library Innovation Lab -- being hopeful about open access scenarios is just one of the services I provide. :)

But basically I'm hopeful in this instance because (a) there's less and less incentive for commercial publishers to try to control this particular low-bandwidth stream of public domain text; (b) there's more and more platforms that would benefit from a standard open feed; and (c) the courts have had a lot more time to think about it (in the grand scheme of how old courts are vs. how old the internet is) and see other courts try it first.

On that last point, we have a dozen or so state supreme courts already using another service of ours, Perma.cc, so we do have some idea how they think about adopting new technology.

Would love to chat more with you or anyone else who's been thinking about how to crack this -- your experience sounds really interesting, and "hopeful" doesn't mean I think it'll be easy. Contact info is in my profile.

There's a very long history to the effort to digitize federal and state case law. West, once a private company, sued Lexis/Nexis claiming that their page numbers were copyrighted. West lost, and perhaps seeing the writing on the wall, sold itself. The copyrighted material added by their editors were summaries of the cases themselves. But wouldn't these be derivative works, given that most of the time they merely copied key text from the opinion, as opposed to creating an original work? With digitization, the key number system, a valuable finding aid in the analog days, became unnecesary. Whatever happened to the Taxpayers' Assets Program? Or the Department of Justice's Juris system? Perhaps Juris was obtained under license from West or Lexis, but the Air Force had their own system, called FLITE (Federal Legal Information through Electronics). The Assets Program tried to get access to Flite but could never get the Air Force to push the button. Remember that the raw material for all of these cases are the opinions written by federal employees; i.e., federal judges. The opinions themselves cannot be copyrighted. A copyright wrapper around them was permitted. I'm surprised that only three states have made their records available under this program: in Florida it was possible to sign up for the feed and receive a zip file of all "published"(that is, filed) cases every week. More than a few private companies jumped in. It's now possible to get access to one library or another at a fraction of the 1980's cost. The deep dark secret of all of this is that, except for a few clarion cases, you only have to pay attention to the last 400 volumes or so; maybe only the last two hundred volumes of the Federal Reporter, Third Series. The U.S. Reports (not the Supreme Court Reporter, which is inside the copyright wrapper along with the Lawyer's Edition) contains no copyrighted material whatsoever. The basic problem is that there is simply too much law. It is difficult to keep up with the law in a single jurisdiction, and impossible to keep up with the case law written in the whole country.

What applications can be built with this? Any attorneys here who can comment?

Some can figure out how to use this data set to train a AI lawyer.

Does it mean it's all in the public domain?

