
Harvard Converts Millions of Legal Documents into Open Data - profquail
http://www.govtech.com/analytics/Harvard-Converts-Millions-of-Legal-Documents-into-Open-Data.html
======
JackC
Hi! I'm a project dev. In case it's helpful, there's lots of Q&A in this
previous thread:
[https://news.ycombinator.com/item?id=18330876](https://news.ycombinator.com/item?id=18330876)

I'll try to answer any questions that pop up here as well.

~~~
ocrcustomserver
Which OCR did you use?

~~~
JackC
ABBYY FineReader -- I don't have the specific version in front of me
unfortunately.

In addition to the structured text we're currently serving through the API, we
also have 300DPI color scans and per-word coordinates and confidence scores,
so there's a lot more we can do with the OCR data that isn't exposed yet.

~~~
just_myles
For those docs that require more finesse (This is OCR :D), was there a massive
manual effort involved?

------
honestlyidk
And arron swartz is dead. Not sure how many of the docs made public are ones
that we was trying to hack in to. That incident will always leave a bitter
taste in my mouth.

~~~
rayiner
None. He was distributing copyrighted articles from JSTOR. These are court
cases which legally cannot be copyrighted (with any copyrighted annotations in
the paper copies properly redacted). The two situations have nothing in
common.

~~~
bpchaps
Heh. And yet case.law has a 500 case download limit except those who have been
approved by research agreements that are mandatory by lexis nexis. I've been
waiting since release night to get a research license approved. Granted, it's
been less than a week.

What do you think would happen if somebody with a research agreement
downloaded everything and released it all to the public? You'd probably then
find that the two situations are very similar.

~~~
pc86
That's pretty tautological, isn't it?

> If someone did a very similar thing here to what Swartz did with JSTOR, this
> situation would then be very similar to what Swartz did with JSTOR.

~~~
gnode
No, the question is asking whether a similar act would result in a similar
outcome; whether there is a crucial difference in how the two perceptibly
similar things would interact with law.

Analogously, one could ask what would happen if someone avoided their taxes
vs. evaded their taxes. Both could be seen as morally the same act, but the
legal consequences are different.

------
tomrod
This is very exciting! I'm looking forward to digging into this info.

I'll read a bit more on the site, but offhand does anyone know if this is an
ongoing effort, in that new (2019 and beyond) cases will be brought in as
well?

~~~
JackC
For now we're stopping with volumes published up through June 2018. The Free
Law Project ([https://free.law](https://free.law)) is a good source for cases
published after that, although they're mostly unofficial versions scraped from
court websites.

We're hopeful that all courts will switch to official digital-first publishing
over the next few years, as a few courts already have. Once the transition is
complete, it might make sense for us to go back and fill in the gap volumes.

~~~
DannyBee
FWIW: We offered many years ago to pay PACER and various states to do
precisely that (move to digital publishing, and make it all open).

They were all unwilling. PACER is at least somewhat reasonable, since we did
not offer to pay them the 145 million a year they were making at the time, and
they felt Congress would kill them if they gave up that revenue source, which
is probably not wrong. (the others, what we offered was much more than they
were making).

So I'm not sure why you have such hope.

For all states hat _have_ moved to digital first publishing, just about all of
them have struck agreements with lexis/etc whereby they have token "free
access" sites and the data is still otherwise locked up.

~~~
JackC
> I'm not sure why you have such hope.

Hey, I work at the Library Innovation Lab -- being hopeful about open access
scenarios is just one of the services I provide. :)

But basically I'm hopeful in this instance because (a) there's less and less
incentive for commercial publishers to try to control this particular low-
bandwidth stream of public domain text; (b) there's more and more platforms
that would benefit from a standard open feed; and (c) the courts have had a
lot more time to think about it (in the grand scheme of how old courts are vs.
how old the internet is) and see other courts try it first.

On that last point, we have a dozen or so state supreme courts already using
another service of ours, Perma.cc, so we do have some idea how they think
about adopting new technology.

Would love to chat more with you or anyone else who's been thinking about how
to crack this -- your experience sounds really interesting, and "hopeful"
doesn't mean I think it'll be easy. Contact info is in my profile.

------
pseingatl
There's a very long history to the effort to digitize federal and state case
law. West, once a private company, sued Lexis/Nexis claiming that their page
numbers were copyrighted. West lost, and perhaps seeing the writing on the
wall, sold itself. The copyrighted material added by their editors were
summaries of the cases themselves. But wouldn't these be derivative works,
given that most of the time they merely copied key text from the opinion, as
opposed to creating an original work? With digitization, the key number
system, a valuable finding aid in the analog days, became unnecesary. Whatever
happened to the Taxpayers' Assets Program? Or the Department of Justice's
Juris system? Perhaps Juris was obtained under license from West or Lexis, but
the Air Force had their own system, called FLITE (Federal Legal Information
through Electronics). The Assets Program tried to get access to Flite but
could never get the Air Force to push the button. Remember that the raw
material for all of these cases are the opinions written by federal employees;
i.e., federal judges. The opinions themselves cannot be copyrighted. A
copyright wrapper around them was permitted. I'm surprised that only three
states have made their records available under this program: in Florida it was
possible to sign up for the feed and receive a zip file of all
"published"(that is, filed) cases every week. More than a few private
companies jumped in. It's now possible to get access to one library or another
at a fraction of the 1980's cost. The deep dark secret of all of this is that,
except for a few clarion cases, you only have to pay attention to the last 400
volumes or so; maybe only the last two hundred volumes of the Federal
Reporter, Third Series. The U.S. Reports (not the Supreme Court Reporter,
which is inside the copyright wrapper along with the Lawyer's Edition)
contains no copyrighted material whatsoever. The basic problem is that there
is simply too much law. It is difficult to keep up with the law in a single
jurisdiction, and impossible to keep up with the case law written in the whole
country.

------
akudha
What applications can be built with this? Any attorneys here who can comment?

------
m3kw9
Some can figure out how to use this data set to train a AI lawyer.

------
rambojazz
Does it mean it's all in the public domain?

