
Fishing for graphs in a Hadoop data lake - bjerun
https://www.oreilly.com/ideas/fishing-for-graphs-in-a-hadoop-data-lake
======
matthewvon3
The article is a nice summary. I think the author missed a key argument for
his multi-model case, AWS costs. Short queries via Spark/Hadoop will cost more
on AWS than a focused graph model on dedicated graph DB / multi-model DB on
AWS.

------
janemanos
Thanks for the article. Seems like a good appeoach to combine the strenghts of
both graph & Hadoop. Wonder which other, in addtion to the described ones, use
cases could be suitable here.

Anyone an idea?

~~~
lmeyerov
We get good visibility into what folks do in practice based on their use of
Graphistry: we're a DB-agnostic scalable visual graph analytics environment,
so we've been seeing (& assisting) what analysts do standalone / what
developers build / what data scientists do from notebooks.

1\. Most graphs are small (< 100M nodes & edges, probably even < 1M). So
analysts just load a CSV directly into us or dump into pandas and work from
there. Most Graphistry users do this. It became so common that we baked in a
transform to our library that shortcuts the data wrangling problem of SQL/CSV
records -> node table + edge table via our "hypergraph" transform.

2\. Sometimes the data is too big or they want to use a query language they're
more comfortable with vs. Pandas. We'll see a bunch of SQL (incl. Spark),
Splunk, Elastic, etc. when approach #1 isn't enough. No need for
Neo4j/Titan/GraphX for that problem. If they end up doing this a lot, Neo4j
ends up being a sensible choice because of the ergonomics of the Cypher query
language.

3\. Sometimes, graph queries or analytics _are_ technically critical. We'll
see mostly analytics via use of NetworkX or maybe iGraph, such as for slightly
better community detection, or something smarter than degrees for node sizes.
Sometimes we'll see query langs, probably Neo4j because (I'm guessing) the
database is packaged accessibly. For ergonomic reasons, I've been expecting
the efforts around OpenCypher for Spark will eventually supplant GraphX for
the exploratory case, and we'll start seeing more Janus as it gains more
steam.

4\. Even more occasionally, people are building true graph algorithms that
cannot be sufficiently approximated with their existing tools. E.g., we're
seeing a bunch more in the knowledge graph space (ex: finance), and in
security/fraud, we're seeing the bigger enterprises needing the same for
correlation work. This gets into powering latency-sensitive ML / detection
algorithms, fast analyst experiences, etc. However, stuff like regular SQL &
Splunk & Spark still gets _most_ teams mostly there with great scaleout etc.,
so there's a bit of a problem/time/budget/expertise thing going on.

We've been happy to support all these kinds of projects at Graphistry -- and
are often part of the entry into them -- so always happy to chat about it.
Likewise, I'm not listing work by good teams like those at Datastax Graph,
Blazegraph, and Amazon Neptune -- we see them, just they're used more in
specific enterprise/federal scenarios.

~~~
neunhoef
Author of posted article here: thanks for the additional pointers. It seems
that graphistry excels at visualization. Essentially, your offering confirms
the main story of the article: make more out of your (graph) data by
extracting it from Hadoop to a different tool.

And obviously, one should use the right tool for the purpose . I think
graphistry is a good choice for graph visualization, graph databases like
ArangoDB or Neo4j will be good at ad hoc traversals. And multi-model databases
like ArangoDB or OrientDB will be good at a wide range of ad hoc queries.
Anyway, thanks again for the pointers.

~~~
lmeyerov
Yep. Maybe the observation is (1) data has gravity -- it was originally in
another non-graph-specific DB -- and (2) the graph structure part is normally
small. So we indeed see a lot of extraction into easier-to-use systems.

The nuance being... with stuff like data science notebooks and pandas, the
people skilled enough to do extraction are also skilled enough that it's
easier to just use pandas. The exception is repeat work or when it is for
regular analysts. Friendly query languages like Neo4j's Cypher helps there.
Not sure what Arango supports... Gremlin? Proprietary?

Graphistry's environment is agnostic, and _not_ a database, so it'd be wrong
of me to advocate teams drop their system of record and use just us ;-) We
ended up building a visual "playbook" investigation environment to help teams
streamline these scenarios. They run visual playbooks against their legacy db
(splunk, elastic, sql, ...) for faux-graph queries, or their new graph db for
deeper ones (e.g., path queries). So we're more of the system of record +
superpowers for your investigations, kind of like a smarter version of what
Tableau/Looker do for SQL.

