NLP on Mueller Report (twitter.com)
Oh, that’s my thread. It’s not written for the HN crowd; my followers on Twitter are a very different demographic. It’s written to help non-technical folks start to consider the possibilities of the technology that exists in their lives now.

They’re just getting a carefully curated peek behind the curtain. The HN crowd knows what’s behind the curtain already.

A couple things: there’s a lot more interesting stuff going on than I’m explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP part of the analysis (and displaCy was the best visual way to explain that to non technical folks!). The approach I’m using to dereference relative dates could be useful for its own spaCy post on its NER, for example. Same with dereferencing last names to the full names they’re referring to. Those two things alone are immensely useful for journalists and data viz designers. Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it that I should write about—like discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used).

Two more things: the flight data is the sleeping dragon of that thread. Really.

And I’ve abandoned the idea of “unredacting” with predictions. For some technical reasons, sure (the redactions, for example, actually displace or edit other text, so it’s impossible to know the correct box size)…but mostly for ethical concerns.

It’s both cool and weird to see my thread posted here! I hope my actual technical post gets a shot when I’m finished with it.

"discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used)."

Please write about that. I remember someone on HN claimed to be able to group accounts belonging to a single individual. When challenged, they offered to email the commenter their own alternate username, and the commenter later replied confirming they were correct. I can't find the the thread now. It might have been minimaxir?

It was lettergram, not minimaxir:


Could you share a little more about the sockpuppet detection? That sounds very fascinating.

Do you specifically mean IRA sockpuppets, or Internet sockpuppeting in general? Were you looking at indications of sockpuppeting from native Russian speakers writing in English? Or identifying what appears to be automatically generated text? Or using text from known sockpuppet accounts to predict if other accounts may be sockpuppets? Some combination? How much stylometry was involved?

Of course that's not the main aim of the person in the thread. The aim is a timeline cross referencing different data sources.


I certainly agree that this wouldn’t tell us anything real about the Mueller report, although it might be a useful exercise to learn about the language models. What really rubs me the wrong way is academics or anyone else with power trying to tell me what is or isn’t “funny” or “interesting” or “fun.” Like, that’s for me to decide, not you!

Before you read a 25 tweet thread: he has not yet done any of the things he is talking about. This post is somewhat premature.

I have, actually. That thread was truncated and a new one started where I’ve shared some new things.

The whole thing is like a weekend project that’s being narrated for my Twitter followers. They’re (in general) a very different demographic than HN.

My technical post later on will be of greater interest to HN.

He'd better hurry, the unredacted version might leak by tomorrow. I'd be amazed if it doesn't appear pretty soon.

The work is more about running named entity recognition on the text and correlating the names with other sources of data than it is about deducing the redacted words (which is probably impossible for the most interesting words). For example if flight XX1234 is mentioned in the text, he might be able to deduce that the plane is owned by some Russian oligarch.

I already have a self-curated list of tail numbers of the private aircraft owned by oligarchs (and other international parties of interest), and a while bunch of ADSB data. Combine that with a timeline of events (and, for some, locations), and there are some…interesting correlations there.

People seem to have filled in what OP is doing based on the headline/first three tweets only...

Yeah, that’s bumming me out a little. It’s the NER (which is getting a lot of my time on this one because spaCy is an amazing tool to extend) and the correlation with other data that’s the interesting part.

Especially those flights.

I wish I could delete the tweet about “unredacting” (or at least edit it to point to my decision later on) without breaking the thread. It was written in a moment of nerd glee, but the ethical considerations are more important to me.

Sorry, I picked up the wrong end of the stick and ran with it. I really shouldn't use the internet when tired.

Why would you have to hurry? Just save the unredacted version and use it when you're ready

> Section 508 requires your PDF to be accessible to users of assistive technology—like screen readers or Braille displays.

Have people filed official complaints about this? I think once the complaint is official (instead of tweets) the department is obligated to respond to it.

The site where the document is published (https://www.justice.gov/sco) shows the following message below the link to the PDF:

"The Department recognizes that these documents may not yet be in an accessible format. If you have a disability and the format of any material on the site interferes with your ability to access some information, please email the Department of Justice webmaster. To enable us to respond in a manner that will be of most help to you, please indicate the nature of the accessibility problem, your preferred format (electronic format (ASCII, etc.), standard print, large print, etc.), the web address of the requested material, and your full contact information, so we can reach you if questions arise while fulfilling your request."

I'm not sure what to make of the message. They don't say whether they'll send an ASCII version if you ask for it.

I have yet to get anything back from them.

Honestly, there is no good reason that born-digital content couldn’t get posted in digital/textual form rather than scanned pages.

The secure redactions created by other software would’ve been preserved, too. It’s inexcusable, IMO, to just scan the damn thing.

To be honest I bet they were scared the redacted sections would leak if they did it entirely digitally, due to ignorance of their own software. Not an excuse, but maybe an explanation.

is that how this works? throw up a disclaimer and i'm good?

i mean, it's fine they are in a tough spot and can't put up an accessible version right now. but that should maybe be answered in court in a year. and people will go to jail if it's a cover up, or not if it was an honest mistake.

The whole notion of 'i might stab you because i'm not good at knives' isn't really plausible to me. however, i'm totally willing to accept student drivers on the road, because everyone needs to learn _somehow_. Hopefully it's an honest mistake, or incompetence, and not anything sinister.

Unfortunately, this is a fairly common work around for govt sites. Same goes for designing a site for multiple browsers, just throw up a caveat that "site is best viewed in IE circa 2001" to avoid actually fixing the problem

This is the type of thing that I imagine the "data suckers" (Facebook, Google, Apple, Amazon) are able to do regularly with the complex data that they have and the advanced tools that they have at their disposal.

I can't see yet to what end that ability would be trained, but I consider that ability to be akin to the state's ability to see into my backyard from space: I know they can do it and it doesn't harm me immediately that they can, but there's something net negative about the power/ability imbalance that makes me feel uncomfortable. Of course nobody is going to use that awesome power on someone as inconsequential as me, but...

I wonder if the "data suckers" have any intention of creating a history of the world, in which no author imposes an interpretation of events. Besides remembering all the data, these companies have tools to parse the data and render it understandable, without imposing a meaning. Call it objective history. That would be valuable.

Surely the report isn't long enough to do this sort of thing? I mean...You can just read it, right?

(And I say this as the developer of the tools he'd likely be using...At least, he has a screenshot of our NER visualiser there.)

The unredacting part was originally borne out of my experiments with a word-level LSTM approach trained on everything the SCO had released. More relevantly, that part was quickly abandoned. It’s all about extracting date-referenced narrative text, and the combination of the NER and the dependency parser have been amazing. Together, they’ve let me begin an extension that dereferences relative dates and last names as though they were pronouns.

displaCy will make an appearance in the final “public” post I’m writing, as well as the tech post for the HN crowd. Thank you so much for your work on spaCy/displaCy!

All these NLP is great but the important question is what insights you gained from these that you didn’t had before?

Honestly just a list of names, people, places and dates mentioned in the document could be a boon to investigative journalists.

That was my prediction and hope. Judging by the DMs I’ve gotten, it was an accurate prediction. :)

I think he points out the ability to correlate non-report data (like airline flights) or to make it more obvious what is in the redacted sections. I am not sure how effective that will be, but overall NER counts and the relationships between people and a coherent timeline would be helpful.

The unredacting stuff was abandoned. It’s problematic even with training on other material from the SCO, and the ethics are fraught.

The correlation is where it’s at, no doubt.

It's a bit boring TBH. Basically OCR+NER, I guess some journalists have already done similar stuff using Google's NLP API.

Is there a central public place (wiki / github / ...) where people collaborate on annotating the Mueller Report?

That looks pretty cool!

Why not a Medium post?

The informal thread was really written for my followers on Twitter, which (mostly) comprise a very different demographic than HN. I use a different voice when speaking to mostly non-technical people. It’s the teacher/public speaker in me, I guess. :)

There will be a technical post later.

