
Turning leaked documents into exploitable data - vinnyglennon
https://linkurio.us/panama-papers-how-linkurious-enables-icij-to-investigate-the-massive-mossack-fonseca-leaks/
======
danso
> _The whole Panama Papers data will be released in early May with Linkurious.
> Follow the stories!_

This is incorrect, according to other ICIJ statements. Not only because of the
pure logistics of distributing 2+TB of data, but out of concern that the file
trove contains a lot of information about private individuals (nevermind
private data for everyone, public and private), and because the file trove
might contain metadata that would inadverdently expose the whistleblower:

[http://www.wired.com/2016/04/reporters-pulled-off-panama-
pap...](http://www.wired.com/2016/04/reporters-pulled-off-panama-papers-
biggest-leak-whistleblower-history/)

> _Ryle says that the media organizations have no plans to release the full
> dataset, WikiLeaks-style, which he argues would expose the sensitive
> information of innocent private individuals along with the public figures on
> which the group’s reporting has focused. “We’re not WikiLeaks. We’re trying
> to show that journalism can be done responsibly,” Ryle says. He says he
> advised the reporters from all the participating media outlets to “go crazy,
> but tell us what’s in the public interest for your country.”_

In past investigations, ICIJ has not released the raw file troves that have
been leaked to them. For example, in their Offshore Leaks project, they
released a 100MB cleaned, machine-readable dataset, but not the actual raw
leak itself:

[https://www.icij.org/blog/2013/04/why-we-will-not-turn-
over-...](https://www.icij.org/blog/2013/04/why-we-will-not-turn-over-
offshore-files-government-agencies)

Likely, they'll produce the extracted machine-readable data they've been using
to do their in-house querying. That's still substantial, of course.

------
DyslexicAtheist
great to see them getting business ... but why am I reading a PR statement
from some data company on HN? Would be more interesting to see how the same
can be achieved using Open Source technologies. The article mentions Solr and
Tika to extract the metadata[1] but that's about it.

[1] pretty sure that just for the purpose of extracting the metadata, the same
can be done with more lightweight tools

