Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is there an open source equivalent to Palantir?
159 points by ljw1001 on Jan 26, 2017 | hide | past | favorite | 45 comments
It seems to me that journalists and activists could benefit from software that could help track networks, timelines, etc., and organize unstructured data the way an intelligence analyst would. Anything like that out there?

I've been working on a project called Aleph (https://github.com/pudo/aleph, live: http://data.occrp.org/), which is targeted at investigative reporters. It's supposed to handle a related set of problems, which is data integration of diverse public records and journalistic lead generation (i.e. what to investigate next). The road map for the coming two years includes some visual analytics similar to what I guess most people use Palantir for (we'll have to keep it very simple).

The big problem in NGO space is obviously finding engineers: we've got some budget for this stuff, but getting people to join us, work at a lower salary and then code their heart out (because teams are tiny) is hard.

On the fun side, you get to see your code applied to fighting real-world problems every day :)

This looks really very interesting! Thank you for the link, I think I might have a few use cases for this if it's easy to work with.

I would love to try out your tool, but I'm getting a 504 Gateway Time-out on http://data.occrp.org/. Is there any chance I can give Aleph a spin?

Yeah we have a scheduled maintenance right now (hello, Docker friends!). Should be back in 10-15min. There's also some other instances:

http://aleph.openoil.net/ http://opengazettes.org.za/


I feel Borges' Aleph is the more charming and less creepy version of Tolkien's Palantir :) Strive for knowledge, not control.

I worked at palantir. As a designer. Grains of salt.

The value is in the dataset. Most of what's on top is handmade by fresh grads.

Spend your effort getting access to the firehose, and the rest will follow.

Orly. The value of Palantir you think is in the dataset but NOT in the full stack 10 ish years old low technical risk analytical platform?

there isnt any 10 ish year old low technical risk analytical platform. The wheel is remade for everycustomer, palantir is not a software company but a services one.

I think you're asking the wrong question. There is nothing special about Palantir software that you can't find in the open source world already. What Palantir does is just throw a bunch of bodies at the problem to "integrate" things.

If you truly want to provide an open source alternative then curating some jupyter notebooks with the right libraries and integration is all it takes. A properly curated set of R libraries with a nice interface would also do the trick.

I think people greatly underestimate the importance of integration, but I take your point

The only thing close that I can think of is: https://github.com/sirensolutions/kibi

Demo video: https://www.youtube.com/watch?v=g0O8UNM0B7Y


> Logos cannot be given yet and we'll likely never say who they are specifically

Why can't you say who they are?

Because Legal Reasons?

The Investigative Consortium of Journalists used Neo4j to untangle the Panama Papers leak.


The problem with Neo4j is that the end results are great, but the ingestion pipeline (especially for unstructured data) is very hard to make general purpose.

The ICIJ used a combination of Apache Tika, Nuix, Tesseract and a bunch of other components when loading data into Neo4j before interrogating it within Linkurious.

It's also worth noting, that Panama data-set is riddled with data quality issues (even if this is understandable given the size of the team compared to the scale of the problem).

My first job out of university was with a Palantir competitor (Detica at the time I was hired, then BAE Systems Detica, then finally BAE Systems Analytics or something like that[1]). There's nothing general purpose about either of their platforms (nor the similar offering from SAS). Companies like that just throw a bunch of fresh graduates at the data, and they hand-write loads of custom ETL code for every data set. A lot of times, even the "analytics" are just shitty little pattern matches over tiny subgraphs formed from the data, the vast majority of which are also coded anew by the "analysts"[2] for each data set. Data quality issues were handled with a massive case analysis during ETL.

There's really no magic going on in such products — the only part that's really general purpose is the GUI used to view the end results. The rest of it is just a bunch of lowly peons doing a ton of gruntwork to hammer the data into a form that said GUI will accept.

[1]: That last name change was after I'd left. I didn't even make it a full year at the company before my conscience got the better of me and I quit.

[2]: An "analytics" job at Detica was really just a half-step above data entry. It was mind-numbing and soul-eating. There was a very high turnover rate because even new graduates were overqualified for the position, and almost everyone was miserable.

In the UK we have https://fullfact.org , an independent fact checking charity. It isn't quite the same thing, e.g. is more of a service than software, but is open source friendly, e.g. was the topic of a recent Lucene hackathon.

Brilliant landing page at present, I must say!

Most of the core platforms these companies use are built on open source. I think the fundamental reason there isn't something comparable in the free and open arena is because at the end of the day to coral the data volumes you will find your self well on your way down the road to hell.

There is a minimum physical hardware cost in the multi-million dollar range just to get started.

Not to mention the labor cost to operate and architect said hardware. Any thing in the Petabyte scale is looking at a minimum skeleton crew of (if they are top talent, more people if they aren't top talent) maybe 3-4 people between sysadmin architect and data engineers to keep things production worthy. The industry is still light on people with those skills so ontop of the bare hardware you are probably talking about 600 to 800 K in annual labor cost as a starting point.

This would all be before you get that much usage, the minute that utilization becomes high that skeleton crew won't cut it any more. The SysAdmin will need to turn into an on-call ops team. Data Engineering will need specialists for on-boarding vs analytics vs access layer.

If you are doing anything controversial don't forget the importance of a solid security organization which is highly challenging in the distributed computing space.

These are just the technical hurdles. Depending on the data you would be bringing in there are quite possibly legal barriers as well depending on the locality.

So again its not so much that this isn't possible, more that there hasn't been someone willing to endow the kind of funding that would be required to scale something of this nature solely for the benefit of the public good.

Nah. I'm thinking of the people trying to understand the relationship between (for example) Trump and the Russians. They don't have that much data because they're not drinking the firehose of wiretaps. They're relying on news reports, contacts, financial documents, etc. It's a question of helping people connect the dots.

I would start looking at gephi: https://gephi.org and sigma.js for visualization. To run influence or social network analysis I would look at R's igraph: http://igraph.org/r/ or spark graphx: http://spark.apache.org/docs/latest/graphx-programming-guide... . For the graph database. neo4j , elasticsearch or the open source graph database created by the uk intelligence service: https://github.com/gchq/Gaffer . Another interesting project is mazerunner that integrates spark graphx and neo4j together : https://neo4j.com/developer/apache-spark/#mazerunner

You may want to look into Linkurious (https://linkurio.us) which integrates Neo4j, ElasticSearch and a graph visualization interface. It was used extensively in the Panama Papers (https://linkurio.us/panama-papers-how-linkurious-enables-ici...)

Jean from Linkurious

That looks interesting. Out of curiosity Is this open source , how does that compare to maze runner ?

We are starting a program at Faraday (http://faraday.io) to offer complimentary access to journalists and educators. Faraday certainly isn't Palantir, but it's similar.

If you're a journalist or educator and want to check it out, shoot me an email, I'm andy@

I dunno, are there non-state actors with the capacity to inflitrate and disrupt militarized adersaries at a global scale in furtherance of the foreign policy motives of a super power on the world stage?

I wonder if they'd release under the GPL3 or or BSD licence. Maybe they'll let me fork them on github.

I'd like to find some software to help me manipulate the world order without being detected. Any cool projects I should look into?

Not exactly what you're asking for. But http://www.jplusplus.org/en/ have mailing lists and is a community of journalists looking through data. I'm sure they have alternatives or suggestions.

I'm thinking about software that does link analysis - this person works with that person, got paid by other person, etc. Basically, computer support for intelligence analysts, but in the public service.

Are you certain that you want a competitor to palantir? It sounds like you might want I2 Analyst Notebook from IBM, or something like maltego from Paterva. Maltego is also free (ish) which is cool.

This is what I'm looking for. Thank you.

Because that capability is too valuable. So if you're interested in writing code that can do that you can find someone to make your life comfortable on the condition that they get to keep your code. If repo of code starts growing that can do that someone will come offering them candies :-).

As it gets a bit more common place, or if tools to do some of the heavy lifing get out.

Keep an eye on some programs like the one at CU Boulder, it should produce some interesting research that moves this along outside of the halls of industy.

I recently release a project of mine that may be of interest: "Gransk - Document processing for investigations" (https://gransk.com and https://github.com/pcbje/gransk)

Given their pricing (and the pricing of nearest competitors), combined with the relatively solved technical problem areas they operate in (connect to a data source, do some human driven NLP, draw a graph, show a map, map everything to some ontology), what they do is very ripe for some serious open source disruption.

The easy part is really the interface and information displays, the harder part (and where they make the lions share of their money) is in data connection services and software customization.

Building an Open Source Palantir tool wouldn't be all that difficult, in fact a great many organizations just build some subset of that tools using readily available open source components and with tighter coupling to their business needs. But these efforts are fractured and disorganized and there isn't a great centralized open source tool that really replicates their system.

Should there be? I think the general problem of pulling together lots of information into a common pool, then being able to annotate that data and map it to a semantic model is useful, and it generalizes well. But at the same time, many many sources of data are already available in nice semantically organized ways, with simpler interfaces (think IMDB, Pouet, Wikipedia, etc.) it's not quite clear that their approach offers enough payoff over these easier methods.

GDELT project provides open data on everything that happens on the planet.

In addition to those on the wikipedia link from Jonnybgood, see also

lumify (lumify.io) developed by Altamira

Visual Understanding Environment (http://vue.tufts.edu/) from Tufts

Lumify is for networked data analysis and data annotation

VUE is more for presentation but has very basic network, and semantic analysis tools and ontology mapping capabilities.

and last but not least - TimeFlow is old but useful for mapping temporal sequences quickly and easily. See https://drawingbynumbers.org/tools/timeflow.html

RTÉ Investigations Unit used OrientDB to find corruption cases: http://orientdb.com/rte-iu_case-study/

There are companies out there(Sumologic is one) which provide free access depending on unstructured data volume. But in case, you have huge volumes of data in TBs and you want to manage it using open source, you might end up spending money in scaling and supporting system that you should instead buy paid versions.

"[..] from software that could help track networks, timelines, etc., and organize unstructured data the way an intelligence analyst would."

Do you know how it looks in Palantir's case? I've never actually seen it ..

no. they keep it pretty hidden, but there are other tools designed for intelligence analysis that are easier to see.

Well, for certain values of "hidden" that include 40 minute long demos of product on Youtube, e.g. https://www.youtube.com/watch?v=f86VKjFSMJE

Look around in their 293 videos on that channel and I'm sure you'll see a wide variety of their products, I just pulled out the first one that popped up and looked like a demo. Remember, companies need to advertise to find new customers, so they can't be too secretive.

(My biggest tip to someone looking to interview at a company is to find their youtube channel and watch a few videos of them demo their product before you interview. You will be an expert on their product compared to 95% of the people they interview, which will make you stand out. If they publish white papers, read one or two that look interesting to you. If they publish white papers and none of them look interesting to you, that's a pretty important sign too. Sure, bone up on how to find a cycle in a directed graph, but spend an hour or two on the company themselves and it will be very rewarding.)

You might try Lumify from Altamira -- http://lumify.io/.

The newer version is called visallo. https://github.com/v5analytics/visallo. same people different company/platform

Its called Wikipedia [ https://youtu.be/ESVQknHESuA?t=2209 ]

The paid versions are called Google, Palantir, the CIA etc.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact