
Ask HN: Is there an open source equivalent to Palantir? - ljw1001
It seems to me that journalists and activists could benefit from software that could help track networks, timelines, etc., and organize unstructured data the way an intelligence analyst would. Anything like that out there?
======
pudo
I've been working on a project called Aleph
([https://github.com/pudo/aleph](https://github.com/pudo/aleph), live:
[http://data.occrp.org/](http://data.occrp.org/)), which is targeted at
investigative reporters. It's supposed to handle a related set of problems,
which is data integration of diverse public records and journalistic lead
generation (i.e. what to investigate next). The road map for the coming two
years includes some visual analytics similar to what I guess most people use
Palantir for (we'll have to keep it very simple).

The big problem in NGO space is obviously finding engineers: we've got some
budget for this stuff, but getting people to join us, work at a lower salary
and then code their heart out (because teams are tiny) is hard.

On the fun side, you get to see your code applied to fighting real-world
problems every day :)

~~~
justusw
I would love to try out your tool, but I'm getting a 504 Gateway Time-out on
[http://data.occrp.org/](http://data.occrp.org/). Is there any chance I can
give Aleph a spin?

~~~
pudo
Yeah we have a scheduled maintenance right now (hello, Docker friends!).
Should be back in 10-15min. There's also some other instances:

[http://aleph.openoil.net/](http://aleph.openoil.net/)
[http://opengazettes.org.za/](http://opengazettes.org.za/)

------
RickS
I worked at palantir. As a designer. Grains of salt.

The value is in the dataset. Most of what's on top is handmade by fresh grads.

Spend your effort getting access to the firehose, and the rest will follow.

~~~
fvey
Orly. The value of Palantir you think is in the dataset but NOT in the full
stack 10 ish years old low technical risk analytical platform?

~~~
ousta
there isnt any 10 ish year old low technical risk analytical platform. The
wheel is remade for everycustomer, palantir is not a software company but a
services one.

------
dkarapetyan
I think you're asking the wrong question. There is nothing special about
Palantir software that you can't find in the open source world already. What
Palantir does is just throw a bunch of bodies at the problem to "integrate"
things.

If you truly want to provide an open source alternative then curating some
jupyter notebooks with the right libraries and integration is all it takes. A
properly curated set of R libraries with a nice interface would also do the
trick.

~~~
ljw1001
I think people greatly underestimate the importance of integration, but I take
your point

------
dwynings
The only thing close that I can think of is:
[https://github.com/sirensolutions/kibi](https://github.com/sirensolutions/kibi)

Demo video:
[https://www.youtube.com/watch?v=g0O8UNM0B7Y](https://www.youtube.com/watch?v=g0O8UNM0B7Y)

------
maxdemarzi
The Investigative Consortium of Journalists used Neo4j to untangle the Panama
Papers leak.

[https://neo4j.com/blog/icij-neo4j-unravel-panama-
papers/](https://neo4j.com/blog/icij-neo4j-unravel-panama-papers/)

~~~
joelschw
The problem with Neo4j is that the end results are great, but the ingestion
pipeline (especially for unstructured data) is very hard to make general
purpose.

The ICIJ used a combination of Apache Tika, Nuix, Tesseract and a bunch of
other components when loading data into Neo4j before interrogating it within
Linkurious.

It's also worth noting, that Panama data-set is riddled with data quality
issues (even if this is understandable given the size of the team compared to
the scale of the problem).

~~~
throwaway2601
My first job out of university was with a Palantir competitor (Detica at the
time I was hired, then BAE Systems Detica, then finally BAE Systems Analytics
or something like that[1]). There's nothing general purpose about either of
their platforms (nor the similar offering from SAS). Companies like that just
throw a bunch of fresh graduates at the data, and they hand-write loads of
custom ETL code for every data set. A lot of times, even the "analytics" are
just shitty little pattern matches over tiny subgraphs formed from the data,
the vast majority of which are also coded anew by the "analysts"[2] for each
data set. Data quality issues were handled with a massive case analysis during
ETL.

There's really no magic going on in such products — the only part that's
really general purpose is the GUI used to view the end results. The rest of it
is just a bunch of lowly peons doing a _ton_ of gruntwork to hammer the data
into a form that said GUI will accept.

[1]: That last name change was after I'd left. I didn't even make it a full
year at the company before my conscience got the better of me and I quit.

[2]: An "analytics" job at Detica was really just a half-step above data
entry. It was mind-numbing and soul-eating. There was a very high turnover
rate because even new graduates were overqualified for the position, and
almost everyone was miserable.

------
m-i-l
In the UK we have [https://fullfact.org](https://fullfact.org) , an
independent fact checking charity. It isn't quite the same thing, e.g. is more
of a service than software, but is open source friendly, e.g. was the topic of
a recent Lucene hackathon.

~~~
mrmondo
Brilliant landing page at present, I must say!

------
batoure
Most of the core platforms these companies use are built on open source. I
think the fundamental reason there isn't something comparable in the free and
open arena is because at the end of the day to coral the data volumes you will
find your self well on your way down the road to hell.

There is a minimum physical hardware cost in the multi-million dollar range
just to get started.

Not to mention the labor cost to operate and architect said hardware. Any
thing in the Petabyte scale is looking at a minimum skeleton crew of (if they
are top talent, more people if they aren't top talent) maybe 3-4 people
between sysadmin architect and data engineers to keep things production
worthy. The industry is still light on people with those skills so ontop of
the bare hardware you are probably talking about 600 to 800 K in annual labor
cost as a starting point.

This would all be before you get that much usage, the minute that utilization
becomes high that skeleton crew won't cut it any more. The SysAdmin will need
to turn into an on-call ops team. Data Engineering will need specialists for
on-boarding vs analytics vs access layer.

If you are doing anything controversial don't forget the importance of a solid
security organization which is highly challenging in the distributed computing
space.

These are just the technical hurdles. Depending on the data you would be
bringing in there are quite possibly legal barriers as well depending on the
locality.

So again its not so much that this isn't possible, more that there hasn't been
someone willing to endow the kind of funding that would be required to scale
something of this nature solely for the benefit of the public good.

~~~
ljw1001
Nah. I'm thinking of the people trying to understand the relationship between
(for example) Trump and the Russians. They don't have that much data because
they're not drinking the firehose of wiretaps. They're relying on news
reports, contacts, financial documents, etc. It's a question of helping people
connect the dots.

------
UCAN2
I would start looking at gephi: [https://gephi.org](https://gephi.org) and
sigma.js for visualization. To run influence or social network analysis I
would look at R's igraph: [http://igraph.org/r/](http://igraph.org/r/) or
spark graphx: [http://spark.apache.org/docs/latest/graphx-programming-
guide...](http://spark.apache.org/docs/latest/graphx-programming-
guide.html#the-property-graph) . For the graph database. neo4j , elasticsearch
or the open source graph database created by the uk intelligence service:
[https://github.com/gchq/Gaffer](https://github.com/gchq/Gaffer) . Another
interesting project is mazerunner that integrates spark graphx and neo4j
together : [https://neo4j.com/developer/apache-
spark/#mazerunner](https://neo4j.com/developer/apache-spark/#mazerunner)

~~~
jvilledieu
You may want to look into Linkurious
([https://linkurio.us](https://linkurio.us)) which integrates Neo4j,
ElasticSearch and a graph visualization interface. It was used extensively in
the Panama Papers ([https://linkurio.us/panama-papers-how-linkurious-enables-
ici...](https://linkurio.us/panama-papers-how-linkurious-enables-icij-to-
investigate-the-massive-mossack-fonseca-leaks/))

Jean from Linkurious

~~~
UCAN2
That looks interesting. Out of curiosity Is this open source , how does that
compare to maze runner ?

------
briteside
We are starting a program at Faraday ([http://faraday.io](http://faraday.io))
to offer complimentary access to journalists and educators. Faraday certainly
isn't Palantir, but it's similar.

If you're a journalist or educator and want to check it out, shoot me an
email, I'm andy@

------
urungusOdor
I dunno, are there non-state actors with the capacity to inflitrate and
disrupt militarized adersaries at a global scale in furtherance of the foreign
policy motives of a super power on the world stage?

I wonder if they'd release under the GPL3 or or BSD licence. Maybe they'll let
me fork them on github.

~~~
helpfulanon
I'd like to find some software to help me manipulate the world order without
being detected. Any cool projects I should look into?

------
chhib
Not exactly what you're asking for. But
[http://www.jplusplus.org/en/](http://www.jplusplus.org/en/) have mailing
lists and is a community of journalists looking through data. I'm sure they
have alternatives or suggestions.

~~~
ljw1001
I'm thinking about software that does link analysis - this person works with
that person, got paid by other person, etc. Basically, computer support for
intelligence analysts, but in the public service.

------
Godel_unicode
Are you certain that you want a competitor to palantir? It sounds like you
might want I2 Analyst Notebook from IBM, or something like maltego from
Paterva. Maltego is also free (ish) which is cool.

~~~
ljw1001
This is what I'm looking for. Thank you.

------
ChuckMcM
Because that capability is too valuable. So if you're interested in writing
code that can do that you can find someone to make your life comfortable on
the condition that they get to keep your code. If repo of code starts growing
that can do that someone will come offering them candies :-).

As it gets a bit more common place, or if tools to do some of the heavy lifing
get out.

Keep an eye on some programs like the one at CU Boulder, it should produce
some interesting research that moves this along outside of the halls of
industy.

------
pcbje
I recently release a project of mine that may be of interest: "Gransk -
Document processing for investigations"
([https://gransk.com](https://gransk.com) and
[https://github.com/pcbje/gransk](https://github.com/pcbje/gransk))

------
bane
Given their pricing (and the pricing of nearest competitors), combined with
the relatively solved technical problem areas they operate in (connect to a
data source, do some human driven NLP, draw a graph, show a map, map
everything to some ontology), what they do is very ripe for some serious open
source disruption.

The easy part is really the interface and information displays, the harder
part (and where they make the lions share of their money) is in data
connection services and software customization.

Building an Open Source Palantir tool wouldn't be all that difficult, in fact
a great many organizations just build some subset of that tools using readily
available open source components and with tighter coupling to their business
needs. But these efforts are fractured and disorganized and there isn't a
great centralized open source tool that really replicates their system.

Should there be? I think the general problem of pulling together lots of
information into a common pool, then being able to annotate that data and map
it to a semantic model is useful, and it generalizes well. But at the same
time, many many sources of data are already available in nice semantically
organized ways, with simpler interfaces (think IMDB, Pouet, Wikipedia, etc.)
it's not quite clear that their approach offers enough payoff over these
easier methods.

------
sAbakumoff
GDELT project provides open data on everything that happens on the planet.

------
jonnybgood
You can start here:
[https://en.wikipedia.org/wiki/Social_network_analysis_softwa...](https://en.wikipedia.org/wiki/Social_network_analysis_software)

~~~
ptrott2017
In addition to those on the wikipedia link from Jonnybgood, see also

lumify (lumify.io) developed by Altamira

Visual Understanding Environment
([http://vue.tufts.edu/](http://vue.tufts.edu/)) from Tufts

Lumify is for networked data analysis and data annotation

VUE is more for presentation but has very basic network, and semantic analysis
tools and ontology mapping capabilities.

and last but not least - TimeFlow is old but useful for mapping temporal
sequences quickly and easily. See
[https://drawingbynumbers.org/tools/timeflow.html](https://drawingbynumbers.org/tools/timeflow.html)

------
lvca
RTÉ Investigations Unit used OrientDB to find corruption cases:
[http://orientdb.com/rte-iu_case-study/](http://orientdb.com/rte-iu_case-
study/)

------
smanikim
There are companies out there(Sumologic is one) which provide free access
depending on unstructured data volume. But in case, you have huge volumes of
data in TBs and you want to manage it using open source, you might end up
spending money in scaling and supporting system that you should instead buy
paid versions.

------
suls
"[..] from software that could help track networks, timelines, etc., and
organize unstructured data the way an intelligence analyst would."

Do you know how it looks in Palantir's case? I've never actually seen it ..

~~~
ljw1001
no. they keep it pretty hidden, but there are other tools designed for
intelligence analysis that are easier to see.

~~~
mandevil
Well, for certain values of "hidden" that include 40 minute long demos of
product on Youtube, e.g.
[https://www.youtube.com/watch?v=f86VKjFSMJE](https://www.youtube.com/watch?v=f86VKjFSMJE)

Look around in their 293 videos on that channel and I'm sure you'll see a wide
variety of their products, I just pulled out the first one that popped up and
looked like a demo. Remember, companies need to advertise to find new
customers, so they can't be too secretive.

(My biggest tip to someone looking to interview at a company is to find their
youtube channel and watch a few videos of them demo their product before you
interview. You will be an expert on their product compared to 95% of the
people they interview, which will make you stand out. If they publish white
papers, read one or two that look interesting to you. If they publish white
papers and none of them look interesting to you, that's a pretty important
sign too. Sure, bone up on how to find a cycle in a directed graph, but spend
an hour or two on the company themselves and it will be very rewarding.)

------
emdeprit
You might try Lumify from Altamira -- [http://lumify.io/](http://lumify.io/).

~~~
ljw1001
The newer version is called visallo.
[https://github.com/v5analytics/visallo](https://github.com/v5analytics/visallo).
same people different company/platform

------
jusq2
Its called Wikipedia [
[https://youtu.be/ESVQknHESuA?t=2209](https://youtu.be/ESVQknHESuA?t=2209) ]

The paid versions are called Google, Palantir, the CIA etc.

