
Ask HN: Information extraction software? - mjfern
I&#x27;m looking for information extraction software that I can feed in historical legal agreements and it will report:<p>1. Changes in the text between the documents<p>2. Changes in other attributes of the documents (e.g., word count)<p>3. % change over time in the text and attributes (e.g., text in the 1986 version of the doc is 56% different than the text in the 1985 version of the doc)<p>Can anyone please point me to software that might fit this particular need?<p>Thanks in advance!
Michael
======
tgflynn
I don't know about software designed specifically for doing this with legal
documents but most of this you could probably do quite easily with some simple
Unix tools and a little scripting. I'm guessing the documents aren't plain
text so the first step would be to extract the text. For example if they're
pdf's you could use pdftotext.

Then for:

1\. diff/diff viewers like xxdiff(the name may have changed somewhat
recently)/git

2\. wc

3\. diff with some scripts to automatically process the documents, count the
number of words in the documents, and write to a csv file

------
salaroglio
You can evaluate the project [http://eucases.eu/](http://eucases.eu/), you can
evalute the AKOMANTOSO standard Here you can find an editor
[https://legixinfo.wordpress.com/2015/07/02/coming-soon-a-
new...](https://legixinfo.wordpress.com/2015/07/02/coming-soon-a-new-web-
based-editor-for-akoma-ntoso/)

------
BjoernKW
Apache UIMA ( [https://uima.apache.org/](https://uima.apache.org/) ) and GATE
( [https://gate.ac.uk/ie/](https://gate.ac.uk/ie/) ) come to mind.

Those are not ready-made software products, though but rather frameworks that
allow you to implement IE algorithms. While not exactly trivial, implementing
something like what you're suggesting is definitely possible with GATE.

------
dreamdu5t
I can build the software for you if you need.

