Hacker News new | past | comments | ask | show | jobs | submit login
The Cambridge Law Corpus: A corpus for legal AI research (arxiv.org)
142 points by belter on Sept 23, 2023 | hide | past | favorite | 41 comments



Good luck doing that in the United States. Pretty much all that kind of data, while supposedly available under the FOIA, is all paywalled by the government or third parties. But it would be interesting to see an AI model.


Harvard Law's Library Innovation Lab has the Caselaw Access Project— a complete collection of precedential US caselaw with structured metadata. It's at https://case.law. It's readable online (including the original pdfs for most,) accessible via a rest API in fully structured documents, and through bulk downloads. There are OCR mistakes here and there, but the accuracy is over 99% even with weird things like the "long s" that looked like an f that was common before the 20th century. While updates are too slow to replace the commercial tools, it's perfect for uses like this... Rather, it will be in a little over 3 months. In Feb of 2018, it was released under agreement with a funder to limit access to 500 cases per day per user for 6 years except for a few jurisdictions available now— so the entire corpus will be completely open then.

They scanned, OCR'd, and applied metadata to 40k volumes, and (digitally) redacted by hand all commercial material (eg head notes, key citations) in all in-copyright volumes, so what's left is entirely in the public domain.

Disclosure: worked on that project for several years.


Thanks Andy. We're also in the midst of releasing the Collaborative Open Legal Dataset ("COLD cases") specifically aimed at AI/ML work. More tk. https://huggingface.co/datasets/harvard-lil/cold-cases


Any idea how complete the COLD dataset is compared to the Caselaw Access Project?

Unfortunately Caselaw limits access to the full text bulk data of most jurisdictions without a research account and I’m trying to find an alternative.


COLD is bigger than Caselaw Access Project. It's over 8 million cases vs 6.9 on case.law.

Please let me know how you find using the data and if you'd like to see any additions or changes!


> Please let me know how you find using the data and if you'd like to see any additions or changes!

I only just started yesterday but so far so good!

Having it in Parquet files on Git LFS makes a huge difference. It only took a few lines to add the entire dataset to our CI/CD cache which is an improvement over the ingestion scripts we have to normally write with change detection and all that. It took less than an hour to start running the cases through our pipeline - I wish all of the GovInfo bulk data were available this way!


I'm really glad you appreciate that!


That's rad. I'll have to dig into it.


> Unfortunately Caselaw limits access to the full text bulk data of most jurisdictions without a research account

That's only true until Feb of 2024! It should be totally unrestricted in a few months.


Great to know, thank you!

Any idea why they're time limited? I assumed that was a license restriction from reporters or FastCase, et. al. which would have been permanent.


To be clear, they are all in the public domain— Fastcase updates included. All of the proprietary info was redacted by hand and the opinions themselves are not copyrightable. The throttling is a contractual obligation to a project partner that limits Harvard's distribution of the cases until Feb of 2024, but that's it. There are also exceptions— cases where the publication is no longer in copyright, and jurisdictions that already publish their opinions online... There are 3 or 4. Those are accessible without throttling through the API and through bulk downloads right now.

This should have more up-to-date and accurate information than I do: https://case.law/about


Rad!


> a complete collection of precedential US caselaw with structured metadata

is it complete? for example it says it is 144k cases in California, I would expect more..


From what I've been told by people who know way more about it than I do, it is complete. One thing to consider is that official published precedential caselaw is from the appellate level up. From what I understand, lower court cases aren't published, though I believe they can be accessed through PACER... But I have no real legal research expertise. The first part of the project entailed a lengthy process involving several decades-experienced legal librarians, lawyers, and archivists mapping out exactly what reporters exist and which ones were "official" (considered authoratative by the courts) and when. Apparently nobody had done it before— well, nobody that made their data available, anyway. HLSL is a library of last resort for law and all but a tiny fraction of the books we scanned are in their collection.

In the about page there's more detailed information about the scope and process.


this makes sense, thank you!


This one isn't even paywalled. Won't give you access even if you have money.

> Applications for access to the Cambridge Law Corpus (CLC) can only be made by researchers who are employed full-time by a recognised university or other research institution. The applicant must hold a permanent position at the level of Assistant Professor (or higher) or equivalent.


Not even adjuncts and postdocs, brutal.


So corporations access will be quickly sorted. Maybe a 1 million dollar research grant by a GAFAM corporation to a promising and enterprising Computer Science Department...


You can get all the patent data from the USPTO for free:

https://github.com/lettergram/parse-uspto-xml/tree/master


Note for U.S. link openers: this is U.K. based case data only.


Why is this comment being downvoted? If it's not true it would be nice to explicitly say so.


"We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions."

So no, you will never see this unless you are a researcher. Academic gatekeeper strikes again


Its amazing how one-sided the public nature of prosecutions are in the sense that they are entitled to everything that is advantageous to the government/harmful to the defendant in the name of the public but they can't quite extend that to ensuring the public has the necesary tools to understand, interpret and successfully operate within the domain of, and sometimes (God forbid) "against" (in an adversarial, not contravening sense) the law itself by playing within its rules to their own benefit or advantage as they should arguably be fully entitled and empowered to.

Assuming its consistent with fundamental justice and precedant (which shouldn't really necessarily require an advanced legal education to competently analyse and resynthesize into a case that integrates the facts of the case in question), we should value the abillity of the citizenry and all litigants in general to quickly and easily present an informed meritorious case and not rely on obfuscation or access issues rooted in technology and deficiencies of any given education system to limit participants' capabillities unduly.


Maybe it's for the best.

Once you feed an LLM all case law, answers to questions like "...so how can I kill my family, collect life insurance on them and get away with it" become trivial for the plebes to access.

As cool as tech advances are, you really don't want everyone unlocking the secrets of estate law or nuclear fusion. Some gatekeeping is necessary.


I suspect the LLM answer would be "There is no loophole that would allow you to kill your family for the life insurance money, but if you're crafty enough, never tell a soul, and keep the secret to your grave, then there's a half chance the case goes cold until your old age, wherein your fifth cousin once removed takes a DNA test that incriminates you, which is good enough."

Sometimes unlocking the secrets of something isn't enough - you need the underlying capital or power to actually pull it off. Normal individuals with normal incomes can't build reliably fusion reactors or pull off insurance fraud schemes no matter how much compute or intelligence they have. If they could, we wouldn't be living in a statist society, we'd be living in 2b2t.


You really think a lawyer could answer that? There isnt any secret sauce to how to murder people legally in the case law.


Case law isn't going to have people who away with it, there's a selection bias there.


Mostly this is correct but it happens. There are always outliers and edge cases and its good for the system to have them so that they can be mitigated by iteratively improving the laws based on how they "won" and "got away with it". The noose is always tightening.


* who got away with it…


There really is no privacy is there?


All court records, filings, judgements, laws, and reviews are public records unless sealed by the courts. It's a matter of whether they have been digitalized or not.


How easy it is to access data is a dimension of privacy, too. When Facebook introduced the Timeline, it made it easy to jump to a year in the past. Before, you had to laboriously scroll back through all recent posts. Making my old embarrassing high school photos easy to look up affects me!

It’s one thing for someone to have to dig up records through obscure legal websites. It’s another for it to show up on the front page of your Google search listing.


Aren't legal proceedings generally public information?


Then why this:

“potentially sensitive nature of this material”.


It's a red herring, unless they're saying it's sensitive or unethical to potentially put law professionals out of work.


Information can be both public and potentially sensitive


Sure, but then it's irrelevant to this... because it's already public, this isn't changing that status.


In the US you have the 4th amendment which proscribes unreasonable search and seizure but that is not a right to privacy. Thus, there is no such right articulated in the amendments themselves. Therefore, it's decided by case law (which is how all such things are decided in common law systems of jurisprudence). Chief amongst such cases is Griswold v. Connecticut, where the ruling found such a right is established by the constitution through a "reading between the lines" (penumbral reasoning).

Okay this isn't conlaw 101 but my point is privacy and right to privacy are technical terms, possibly unintuitive to you. Now someone's rights may have been arrogate here - I don't know - but that would be again argued in a court of law.


This is a very narrow and America-centric viewpoint. Privacy is a codified constitutional human right in many nations, including details and edge cases addressed by legislation and/or case law.


I am very confused by your response - how are we in disagreement in the least?


Thanks. I'll submit some briefs using this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: