
Feeding Graph databases – a third use-case for modern log management platforms - lennartkoopmann
https://medium.com/@henrikjohansen/feeding-graph-databases-a-third-use-case-for-modern-log-management-platforms-d5dac8a80d53
======
lmeyerov
Awesome article! This is the exact kind of use case we've been helping
enterprises with at Graphistry, especially for SIEMs and operations data.
Worth adding two aspects we've been finding important in our journey here:

* We found the need to play nice with Neo4j as well other more common systems here like Kafka/HDFS/Spark, Titan, and Splunk

* It helps to be able to work with big event graphs, where we'll often want to do something like filter for the day's 1M+ priority 10 alerts and see how they connect. The result is we spend a lot of time on our GPU frontend+backend so you can spot patterns in all of the days big events, and exploratory tooling so you can drill down rather than write queries.

If relevant, happy to share an API key (info@graphistry.com) or get on Skype!

------
david_p
Since this seems to be trending: any questions about Linkurious or graph
databases are welcome. Linkurious CTO here :)

~~~
rspeer
I abandoned Neo4j -- and graph databases in general -- years ago because there
was no reasonable way to load in several million edges that were not already
in a graph database.

Does Neo4j have a better importing story now? I see a blog post from 2014 [1]
that makes importing merely a million edges between 500 nodes sound like it's
still a terribly difficult operation, giving me the impression that graph
databases aren't quite ready for <s>big</s> medium data yet.

[1] [http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-
and-...](http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-
successfully/)

If, for example, I wanted to load an N-Triples file that's approximately the
size and shape of DBPedia, can I reasonably do so? What tools should I use to
get the job done quickly without descending into a Java nightmare?

~~~
lmeyerov
We settled on "medium"-batching into Titan, and sharing your experience, I'm
hoping the Datastax acquisition means ingest will improve.

For the bulk of our work, we do what you'd expect -- load terabytes into HDFS,
(Py)Spark for straight SQL and some join helper functions, and occasionally,
add some GraphX scala libraries.

I'm curious -- what did you end up doing, & for what?

~~~
rspeer
I maintain ConceptNet [1]. It's a large-ish, messy semantic graph linking
together crowd-sourced knowledge (from Open Mind Common Sense, Wiktionary,
some Games with a Purpose, etc.) with expert-created resources (WordNet,
OpenCyc, etc.)

[1] [http://conceptnet5.media.mit.edu](http://conceptnet5.media.mit.edu)

It's turned out to be a good input for machine learning about semantics, which
has changed the goals of its representation a bit -- not only do I need to be
able to load in data easily, I also need to be able to iterate over all of it.
But some graph operations would be nice to have, too.

Many technical people I describe the project to immediately ask me what graph
database I'm using, both before and after the ill-fated semester of grad
school where I actually tried to use graph databases.

The answer to what I use now is: a bit of SQLite and some flat files. No need
for HDFS, it still fits easily on a hard disk.

------
lobster_johnson
Tracking flow information about networks and apps on Linux is something I have
been thinking about. I'm wondering how other people are doing it.

For networking in general, I suppose you could sample /proc/PID/fd and
/proc/PID/net/tcp regular intervals, though it would technically miss some
connections.

For apps — specifically, microservices — I'm thinking that every app could be
modified to emit pairs [from, to] to statsd, which can then be used to
transfer the data to a central collector. The downside is that every RPC
request has to do this, in all the languages your microservices are written
in.

