They’re just getting a carefully curated peek behind the curtain. The HN crowd knows what’s behind the curtain already.
A couple things: there’s a lot more interesting stuff going on than I’m explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP part of the analysis (and displaCy was the best visual way to explain that to non technical folks!). The approach I’m using to dereference relative dates could be useful for its own spaCy post on its NER, for example. Same with dereferencing last names to the full names they’re referring to. Those two things alone are immensely useful for journalists and data viz designers. Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it that I should write about—like discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used).
Two more things: the flight data is the sleeping dragon of that thread. Really.
And I’ve abandoned the idea of “unredacting” with predictions. For some technical reasons, sure (the redactions, for example, actually displace or edit other text, so it’s impossible to know the correct box size)…but mostly for ethical concerns.
It’s both cool and weird to see my thread posted here! I hope my actual technical post gets a shot when I’m finished with it.
Please write about that. I remember someone on HN claimed to be able to group accounts belonging to a single individual. When challenged, they offered to email the commenter their own alternate username, and the commenter later replied confirming they were correct. I can't find the the thread now. It might have been minimaxir?
Do you specifically mean IRA sockpuppets, or Internet sockpuppeting in general? Were you looking at indications of sockpuppeting from native Russian speakers writing in English? Or identifying what appears to be automatically generated text? Or using text from known sockpuppet accounts to predict if other accounts may be sockpuppets? Some combination? How much stylometry was involved?
The whole thing is like a weekend project that’s being narrated for my Twitter followers. They’re (in general) a very different demographic than HN.
My technical post later on will be of greater interest to HN.
Especially those flights.
I wish I could delete the tweet about “unredacting” (or at least edit it to point to my decision later on) without breaking the thread. It was written in a moment of nerd glee, but the ethical considerations are more important to me.
Have people filed official complaints about this? I think once the complaint is official (instead of tweets) the department is obligated to respond to it.
"The Department recognizes that these documents may not yet be in an accessible format. If you have a disability and the format of any material on the site interferes with your ability to access some information, please email the Department of Justice webmaster. To enable us to respond in a manner that will be of most help to you, please indicate the nature of the accessibility problem, your preferred format (electronic format (ASCII, etc.), standard print, large print, etc.), the web address of the requested material, and your full contact information, so we can reach you if questions arise while fulfilling your request."
I'm not sure what to make of the message. They don't say whether they'll send an ASCII version if you ask for it.
Honestly, there is no good reason that born-digital content couldn’t get posted in digital/textual form rather than scanned pages.
The secure redactions created by other software would’ve been preserved, too. It’s inexcusable, IMO, to just scan the damn thing.
i mean, it's fine they are in a tough spot and can't put up an accessible version right now. but that should maybe be answered in court in a year. and people will go to jail if it's a cover up, or not if it was an honest mistake.
The whole notion of 'i might stab you because i'm not good at knives' isn't really plausible to me. however, i'm totally willing to accept student drivers on the road, because everyone needs to learn _somehow_. Hopefully it's an honest mistake, or incompetence, and not anything sinister.
I can't see yet to what end that ability would be trained, but I consider that ability to be akin to the state's ability to see into my backyard from space: I know they can do it and it doesn't harm me immediately that they can, but there's something net negative about the power/ability imbalance that makes me feel uncomfortable. Of course nobody is going to use that awesome power on someone as inconsequential as me, but...
(And I say this as the developer of the tools he'd likely be using...At least, he has a screenshot of our NER visualiser there.)
displaCy will make an appearance in the final “public” post I’m writing, as well as the tech post for the HN crowd. Thank you so much for your work on spaCy/displaCy!
The correlation is where it’s at, no doubt.
There will be a technical post later.