
Structured vs. Unstructured Data for legal and business professionals - lawtomated
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
======
tannhaeuser
The thing the author is looking for is called _semistructured data_ , as in
SGML/XML. Where data can live in containers/files users can relate to, can be
rendered to make sense to a user (y'know like in a browser), can evolve with
requirements, and can still be extracted with formal precision. And it's quite
heavily used in legal as well since about the 1980s.

~~~
lawtomated
True, maybe we should have clarified that in our article.

The point we were trying to make is different however. That point is this: in
reality most legal data, in its final authoritative form, at least re
contracts, is stored in a scanned pdf of the signed agreement, ie an image not
the original Word doc containing semi structured data.

Granted, if lawyers didn't scan docs as images and hold only those scanned
images to be the authoritative data on a particular matter things might be
different.

Lots of projects in the works at law firms and in house legal teams to try and
maintain contracts as structured, or at least semi structured, data from
cradle to grave but still old habits of scanning contracts persists.

Not sure if that adds clarity. Be good to know. If it does (or if it doesn't)
be good to understand so we can improve our content :)

~~~
tannhaeuser
I see. I just wanted to point out that semistructured documents have a long
history in law in particular, with some of the oldest text database in use.
AFAIK, law firms were holding on to WordPerfect for a long time (and many are
using it still) even when MS Word became the de-facto mainstream format, and
WordPerfect has a rich history of structured, non-WYSIWYG editing, and could
be converted to SGML as early as 1992. So I guess if the problem today is
overuse of binary-only transport formats such as PDF, a discussion about
representation of text in law should start with a look back, with a
perspective on what's been lost (not to mention that a paged media format is
probably not the way forward in 2019 or even 1999 for that matter).

~~~
lawtomated
Definitely agree. In legal we've ended up with this odd process by which we
start with semistructured data that could, but isn't, stored as structured
data, e.g. the summary deal terms in a corporate context. That is usually a
Word table of key value pairs, e.g. parties, financial values, percentages,
key clause types or even the exact text to be included.

That is then negotiated into a long form MS Word contract, negotiated and then
signed and physically scanned back into a machine as a PDF (vs PDFing the
native word doc and preserving the text layer).

Very avoidable as you note, and actually solvable via a look backwards and /
or redesigning the technical paradigm for representing a contracts data model
from cradle to grave.

