
NLP on Mueller Report - rahimnathwani
https://twitter.com/spdustin/status/1119118085443559425
======
spdustin
Oh, that’s my thread. It’s not written for the HN crowd; my followers on
Twitter are a very different demographic. It’s written to help non-technical
folks start to consider the possibilities of the technology that exists in
their lives now.

They’re just getting a carefully curated peek behind the curtain. The HN crowd
knows what’s behind the curtain already.

A couple things: there’s a lot more interesting stuff going on than I’m
explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP
part of the analysis (and displaCy was the best visual way to explain that to
non technical folks!). The approach I’m using to dereference relative dates
could be useful for its own spaCy post on its NER, for example. Same with
dereferencing last names to the full names they’re referring to. Those two
things alone are immensely useful for journalists and data viz designers.
Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it
that I should write about—like discovering sock puppet accounts by comparing
simple grammar signatures (relative ratios of parts of speech used in their
text, along with other features like where adverbs and prepositions are used).

Two more things: the flight data is the sleeping dragon of that thread.
Really.

And I’ve abandoned the idea of “unredacting” with predictions. For some
technical reasons, sure (the redactions, for example, actually displace or
edit other text, so it’s impossible to know the correct box size)…but mostly
for ethical concerns.

It’s both cool and weird to see my thread posted here! I hope my actual
technical post gets a shot when I’m finished with it.

~~~
rahimnathwani
"discovering sock puppet accounts by comparing simple grammar signatures
(relative ratios of parts of speech used in their text, along with other
features like where adverbs and prepositions are used)."

Please write about that. I remember someone on HN claimed to be able to group
accounts belonging to a single individual. When challenged, they offered to
email the commenter their own alternate username, and the commenter later
replied confirming they were correct. I can't find the the thread now. It
might have been minimaxir?

~~~
rahimnathwani
It was lettergram, not minimaxir:

[https://news.ycombinator.com/item?id=17944484](https://news.ycombinator.com/item?id=17944484)

------
phowon
And here is a tweet thread on why using NLP models to "fill in" the redacted
portions is a horrendously terrible idea.

[https://twitter.com/emilymbender/status/1119081131234611201](https://twitter.com/emilymbender/status/1119081131234611201)

~~~
Certhas
Of course that's not the main aim of the person in the thread. The aim is a
timeline cross referencing different data sources.

~~~
spdustin
Bingo!

------
georgespencer
Before you read a 25 tweet thread: he has not yet done any of the things he is
talking about. This post is somewhat premature.

~~~
inflatableDodo
He'd better hurry, the unredacted version might leak by tomorrow. I'd be
amazed if it doesn't appear pretty soon.

~~~
soVeryTired
The work is more about running named entity recognition on the text and
correlating the names with other sources of data than it is about deducing the
redacted words (which is probably impossible for the most interesting words).
For example if flight XX1234 is mentioned in the text, he might be able to
deduce that the plane is owned by some Russian oligarch.

~~~
Certhas
People seem to have filled in what OP is doing based on the headline/first
three tweets only...

~~~
spdustin
Yeah, that’s bumming me out a little. It’s the NER (which is getting a lot of
my time on this one because spaCy is an amazing tool to extend) and the
correlation with other data that’s the interesting part.

Especially those flights.

I wish I could delete the tweet about “unredacting” (or at least edit it to
point to my decision later on) without breaking the thread. It was written in
a moment of nerd glee, but the ethical considerations are more important to
me.

~~~
inflatableDodo
Sorry, I picked up the wrong end of the stick and ran with it. I really
shouldn't use the internet when tired.

------
lake99
> Section 508 requires your PDF to be accessible to users of assistive
> technology—like screen readers or Braille displays.

Have people filed official complaints about this? I think once the complaint
is official (instead of tweets) the department is obligated to respond to it.

~~~
rahimnathwani
The site where the document is published
([https://www.justice.gov/sco](https://www.justice.gov/sco)) shows the
following message below the link to the PDF:

"The Department recognizes that these documents may not yet be in an
accessible format. If you have a disability and the format of any material on
the site interferes with your ability to access some information, please email
the Department of Justice webmaster. To enable us to respond in a manner that
will be of most help to you, please indicate the nature of the accessibility
problem, your preferred format (electronic format (ASCII, etc.), standard
print, large print, etc.), the web address of the requested material, and your
full contact information, so we can reach you if questions arise while
fulfilling your request."

I'm not sure what to make of the message. They don't say whether they'll send
an ASCII version if you ask for it.

~~~
spdustin
I have yet to get anything back from them.

Honestly, there is no good reason that born-digital content couldn’t get
posted in digital/textual form rather than scanned pages.

The secure redactions created by other software would’ve been preserved, too.
It’s inexcusable, IMO, to just scan the damn thing.

~~~
diminoten
To be honest I bet they were scared the redacted sections would leak if they
did it entirely digitally, due to ignorance of their own software. Not an
excuse, but maybe an explanation.

------
yay_cloud2
This is the type of thing that I imagine the "data suckers" (Facebook, Google,
Apple, Amazon) are able to do regularly with the complex data that they have
and the advanced tools that they have at their disposal.

I can't see yet to what end that ability would be trained, but I consider that
ability to be akin to the state's ability to see into my backyard from space:
I know they can do it and it doesn't harm me immediately that they can, but
there's something net negative about the power/ability imbalance that makes me
feel uncomfortable. Of course nobody is going to use that awesome power on
someone as inconsequential as me, but...

~~~
beautifulfreak
I wonder if the "data suckers" have any intention of creating a history of the
world, in which no author imposes an interpretation of events. Besides
remembering all the data, these companies have tools to parse the data and
render it understandable, without imposing a meaning. Call it objective
history. That would be valuable.

------
fit2rule
Angelastic ran her haiku detector on the Mueller report and found some amusing
results:

[https://angelastic.com/2019/04/20/unintentional-haiku-in-
the...](https://angelastic.com/2019/04/20/unintentional-haiku-in-the-mueller-
report/)

------
armantor
Meanwhile check this if you are interested: [https://www.axios.com/explore-a-
detailed-version-of-the-muel...](https://www.axios.com/explore-a-detailed-
version-of-the-mueller-report-5f7cab5b-9c53-46bc-abaa-bd6b7b3e6d66.html)

------
syllogism
Surely the report isn't long enough to do this sort of thing? I mean...You can
just read it, right?

(And I say this as the developer of the tools he'd likely be using...At least,
he has a screenshot of our NER visualiser there.)

~~~
spdustin
The unredacting part was originally borne out of my experiments with a word-
level LSTM approach trained on everything the SCO had released. More
relevantly, that part was quickly abandoned. It’s all about extracting date-
referenced narrative text, and the combination of the NER and the dependency
parser have been amazing. Together, they’ve let me begin an extension that
dereferences relative dates and last names as though they were pronouns.

displaCy will make an appearance in the final “public” post I’m writing, as
well as the tech post for the HN crowd. Thank you so much for your work on
spaCy/displaCy!

------
sytelus
All these NLP is great but the important question is what insights you gained
from these that you didn’t had before?

~~~
soVeryTired
Honestly just a list of names, people, places and dates mentioned in the
document could be a boon to investigative journalists.

~~~
spdustin
That was my prediction and hope. Judging by the DMs I’ve gotten, it was an
accurate prediction. :)

------
rayrrr
Here's some NLP text summarization of the Mueller Report in action (by yours
truly):
[https://news.ycombinator.com/item?id=19815506](https://news.ycombinator.com/item?id=19815506)

------
1wd
Is there a central public place (wiki / github / ...) where people collaborate
on annotating the Mueller Report?

------
dang
Url changed from
[https://threadreaderapp.com/thread/1119118085443559425.html](https://threadreaderapp.com/thread/1119118085443559425.html),
which points to this.

------
nwrk
That looks pretty cool!

------
daRealDodo
Why not a Medium post?

~~~
spdustin
The informal thread was really written for my followers on Twitter, which
(mostly) comprise a very different demographic than HN. I use a different
voice when speaking to mostly non-technical people. It’s the teacher/public
speaker in me, I guess. :)

There will be a technical post later.

