Good luck doing that in the United States. Pretty much all that kind of data, while supposedly available under the FOIA, is all paywalled by the government or third parties. But it would be interesting to see an AI model.
Harvard Law's Library Innovation Lab has the Caselaw Access Project— a complete collection of precedential US caselaw with structured metadata. It's at https://case.law. It's readable online (including the original pdfs for most,) accessible via a rest API in fully structured documents, and through bulk downloads. There are OCR mistakes here and there, but the accuracy is over 99% even with weird things like the "long s" that looked like an f that was common before the 20th century. While updates are too slow to replace the commercial tools, it's perfect for uses like this... Rather, it will be in a little over 3 months. In Feb of 2018, it was released under agreement with a funder to limit access to 500 cases per day per user for 6 years except for a few jurisdictions available now— so the entire corpus will be completely open then.
They scanned, OCR'd, and applied metadata to 40k volumes, and (digitally) redacted by hand all commercial material (eg head notes, key citations) in all in-copyright volumes, so what's left is entirely in the public domain.
Disclosure: worked on that project for several years.
> Please let me know how you find using the data and if you'd like to see any additions or changes!
I only just started yesterday but so far so good!
Having it in Parquet files on Git LFS makes a huge difference. It only took a few lines to add the entire dataset to our CI/CD cache which is an improvement over the ingestion scripts we have to normally write with change detection and all that. It took less than an hour to start running the cases through our pipeline - I wish all of the GovInfo bulk data were available this way!
To be clear, they are all in the public domain— Fastcase updates included. All of the proprietary info was redacted by hand and the opinions themselves are not copyrightable. The throttling is a contractual obligation to a project partner that limits Harvard's distribution of the cases until Feb of 2024, but that's it. There are also exceptions— cases where the publication is no longer in copyright, and jurisdictions that already publish their opinions online... There are 3 or 4. Those are accessible without throttling through the API and through bulk downloads right now.
This should have more up-to-date and accurate information than I do:
https://case.law/about
From what I've been told by people who know way more about it than I do, it is complete. One thing to consider is that official published precedential caselaw is from the appellate level up. From what I understand, lower court cases aren't published, though I believe they can be accessed through PACER... But I have no real legal research expertise. The first part of the project entailed a lengthy process involving several decades-experienced legal librarians, lawyers, and archivists mapping out exactly what reporters exist and which ones were "official" (considered authoratative by the courts) and when. Apparently nobody had done it before— well, nobody that made their data available, anyway. HLSL is a library of last resort for law and all but a tiny fraction of the books we scanned are in their collection.
In the about page there's more detailed information about the scope and process.
This one isn't even paywalled. Won't give you access even if you have money.
> Applications for access to the Cambridge Law Corpus (CLC) can only be made by researchers who are employed full-time by a recognised university or other research institution. The applicant must hold a permanent position at the level of Assistant Professor (or higher) or equivalent.
So corporations access will be quickly sorted. Maybe a 1 million dollar research grant by a GAFAM corporation to a promising and enterprising Computer Science Department...
"We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions."
So no, you will never see this unless you are a researcher. Academic gatekeeper strikes again
Its amazing how one-sided the public nature of prosecutions are in the sense that they are entitled to everything that is advantageous to the government/harmful to the defendant in the name of the public but they can't quite extend that to ensuring the public has the necesary tools to understand, interpret and successfully operate within the domain of, and sometimes (God forbid) "against" (in an adversarial, not contravening sense) the law itself by playing within its rules to their own benefit or advantage as they should arguably be fully entitled and empowered to.
Assuming its consistent with fundamental justice and precedant (which shouldn't really necessarily require an advanced legal education to competently analyse and resynthesize into a case that integrates the facts of the case in question), we should value the abillity of the citizenry and all litigants in general to quickly and easily present an informed meritorious case and not rely on obfuscation or access issues rooted in technology and deficiencies of any given education system to limit participants' capabillities unduly.
Once you feed an LLM all case law, answers to questions like "...so how can I kill my family, collect life insurance on them and get away with it" become trivial for the plebes to access.
As cool as tech advances are, you really don't want everyone unlocking the secrets of estate law or nuclear fusion. Some gatekeeping is necessary.
I suspect the LLM answer would be "There is no loophole that would allow you to kill your family for the life insurance money, but if you're crafty enough, never tell a soul, and keep the secret to your grave, then there's a half chance the case goes cold until your old age, wherein your fifth cousin once removed takes a DNA test that incriminates you, which is good enough."
Sometimes unlocking the secrets of something isn't enough - you need the underlying capital or power to actually pull it off. Normal individuals with normal incomes can't build reliably fusion reactors or pull off insurance fraud schemes no matter how much compute or intelligence they have. If they could, we wouldn't be living in a statist society, we'd be living in 2b2t.
Mostly this is correct but it happens. There are always outliers and edge cases and its good for the system to have them so that they can be mitigated by iteratively improving the laws based on how they "won" and "got away with it". The noose is always tightening.
All court records, filings, judgements, laws, and reviews are public records unless sealed by the courts. It's a matter of whether they have been digitalized or not.
How easy it is to access data is a dimension of privacy, too. When Facebook introduced the Timeline, it made it easy to jump to a year in the past. Before, you had to laboriously scroll back through all recent posts. Making my old embarrassing high school photos easy to look up affects me!
It’s one thing for someone to have to dig up records through obscure legal websites. It’s another for it to show up on the front page of your Google search listing.
In the US you have the 4th amendment which proscribes unreasonable search and seizure but that is not a right to privacy. Thus, there is no such right articulated in the amendments themselves. Therefore, it's decided by case law (which is how all such things are decided in common law systems of jurisprudence). Chief amongst such cases is Griswold v. Connecticut, where the ruling found such a right is established by the constitution through a "reading between the lines" (penumbral reasoning).
Okay this isn't conlaw 101 but my point is privacy and right to privacy are technical terms, possibly unintuitive to you. Now someone's rights may have been arrogate here - I don't know - but that would be again argued in a court of law.
This is a very narrow and America-centric viewpoint. Privacy is a codified constitutional human right in many nations, including details and edge cases addressed by legislation and/or case law.