The big problem in NGO space is obviously finding engineers: we've got some budget for this stuff, but getting people to join us, work at a lower salary and then code their heart out (because teams are tiny) is hard.
On the fun side, you get to see your code applied to fighting real-world problems every day :)
The value is in the dataset. Most of what's on top is handmade by fresh grads.
Spend your effort getting access to the firehose, and the rest will follow.
If you truly want to provide an open source alternative then curating some jupyter notebooks with the right libraries and integration is all it takes. A properly curated set of R libraries with a nice interface would also do the trick.
Demo video: https://www.youtube.com/watch?v=g0O8UNM0B7Y
Why can't you say who they are?
The ICIJ used a combination of Apache Tika, Nuix, Tesseract and a bunch of other components when loading data into Neo4j before interrogating it within Linkurious.
It's also worth noting, that Panama data-set is riddled with data quality issues (even if this is understandable given the size of the team compared to the scale of the problem).
There's really no magic going on in such products — the only part that's really general purpose is the GUI used to view the end results. The rest of it is just a bunch of lowly peons doing a ton of gruntwork to hammer the data into a form that said GUI will accept.
: That last name change was after I'd left. I didn't even make it a full year at the company before my conscience got the better of me and I quit.
: An "analytics" job at Detica was really just a half-step above data entry. It was mind-numbing and soul-eating. There was a very high turnover rate because even new graduates were overqualified for the position, and almost everyone was miserable.
There is a minimum physical hardware cost in the multi-million dollar range just to get started.
Not to mention the labor cost to operate and architect said hardware. Any thing in the Petabyte scale is looking at a minimum skeleton crew of (if they are top talent, more people if they aren't top talent) maybe 3-4 people between sysadmin architect and data engineers to keep things production worthy. The industry is still light on people with those skills so ontop of the bare hardware you are probably talking about 600 to 800 K in annual labor cost as a starting point.
This would all be before you get that much usage, the minute that utilization becomes high that skeleton crew won't cut it any more. The SysAdmin will need to turn into an on-call ops team. Data Engineering will need specialists for on-boarding vs analytics vs access layer.
If you are doing anything controversial don't forget the importance of a solid security organization which is highly challenging in the distributed computing space.
These are just the technical hurdles. Depending on the data you would be bringing in there are quite possibly legal barriers as well depending on the locality.
So again its not so much that this isn't possible, more that there hasn't been someone willing to endow the kind of funding that would be required to scale something of this nature solely for the benefit of the public good.
Jean from Linkurious
If you're a journalist or educator and want to check it out, shoot me an email, I'm andy@
I wonder if they'd release under the GPL3 or or BSD licence. Maybe they'll let me fork them on github.
As it gets a bit more common place, or if tools to do some of the heavy lifing get out.
Keep an eye on some programs like the one at CU Boulder, it should produce some interesting research that moves this along outside of the halls of industy.
The easy part is really the interface and information displays, the harder part (and where they make the lions share of their money) is in data connection services and software customization.
Building an Open Source Palantir tool wouldn't be all that difficult, in fact a great many organizations just build some subset of that tools using readily available open source components and with tighter coupling to their business needs. But these efforts are fractured and disorganized and there isn't a great centralized open source tool that really replicates their system.
Should there be? I think the general problem of pulling together lots of information into a common pool, then being able to annotate that data and map it to a semantic model is useful, and it generalizes well. But at the same time, many many sources of data are already available in nice semantically organized ways, with simpler interfaces (think IMDB, Pouet, Wikipedia, etc.) it's not quite clear that their approach offers enough payoff over these easier methods.
lumify (lumify.io) developed by Altamira
Visual Understanding Environment (http://vue.tufts.edu/) from Tufts
Lumify is for networked data analysis and data annotation
VUE is more for presentation but has very basic network, and semantic analysis tools and ontology mapping capabilities.
and last but not least - TimeFlow is old but useful for mapping temporal sequences quickly and easily. See
Do you know how it looks in Palantir's case? I've never actually seen it ..
Look around in their 293 videos on that channel and I'm sure you'll see a wide variety of their products, I just pulled out the first one that popped up and looked like a demo. Remember, companies need to advertise to find new customers, so they can't be too secretive.
(My biggest tip to someone looking to interview at a company is to find their youtube channel and watch a few videos of them demo their product before you interview. You will be an expert on their product compared to 95% of the people they interview, which will make you stand out. If they publish white papers, read one or two that look interesting to you. If they publish white papers and none of them look interesting to you, that's a pretty important sign too. Sure, bone up on how to find a cycle in a directed graph, but spend an hour or two on the company themselves and it will be very rewarding.)
The paid versions are called Google, Palantir, the CIA etc.