Feeding Graph databases – a third use-case for modern log management platforms

lmeyerov · on Dec 31, 2015

Awesome article! This is the exact kind of use case we've been helping enterprises with at Graphistry, especially for SIEMs and operations data. Worth adding two aspects we've been finding important in our journey here:

* We found the need to play nice with Neo4j as well other more common systems here like Kafka/HDFS/Spark, Titan, and Splunk

* It helps to be able to work with big event graphs, where we'll often want to do something like filter for the day's 1M+ priority 10 alerts and see how they connect. The result is we spend a lot of time on our GPU frontend+backend so you can spot patterns in all of the days big events, and exploratory tooling so you can drill down rather than write queries.

If relevant, happy to share an API key (info@graphistry.com) or get on Skype!

david_p · on Dec 30, 2015

Since this seems to be trending: any questions about Linkurious or graph databases are welcome. Linkurious CTO here :)

rspeer · on Dec 31, 2015

I abandoned Neo4j -- and graph databases in general -- years ago because there was no reasonable way to load in several million edges that were not already in a graph database.

Does Neo4j have a better importing story now? I see a blog post from 2014 [1] that makes importing merely a million edges between 500 nodes sound like it's still a terribly difficult operation, giving me the impression that graph databases aren't quite ready for <s>big</s> medium data yet.

[1] http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-...

If, for example, I wanted to load an N-Triples file that's approximately the size and shape of DBPedia, can I reasonably do so? What tools should I use to get the job done quickly without descending into a Java nightmare?

lmeyerov · on Dec 31, 2015

We settled on "medium"-batching into Titan, and sharing your experience, I'm hoping the Datastax acquisition means ingest will improve.

For the bulk of our work, we do what you'd expect -- load terabytes into HDFS, (Py)Spark for straight SQL and some join helper functions, and occasionally, add some GraphX scala libraries.

I'm curious -- what did you end up doing, & for what?

rspeer · on Dec 31, 2015

I maintain ConceptNet [1]. It's a large-ish, messy semantic graph linking together crowd-sourced knowledge (from Open Mind Common Sense, Wiktionary, some Games with a Purpose, etc.) with expert-created resources (WordNet, OpenCyc, etc.)

[1] http://conceptnet5.media.mit.edu

It's turned out to be a good input for machine learning about semantics, which has changed the goals of its representation a bit -- not only do I need to be able to load in data easily, I also need to be able to iterate over all of it. But some graph operations would be nice to have, too.

Many technical people I describe the project to immediately ask me what graph database I'm using, both before and after the ill-fated semester of grad school where I actually tried to use graph databases.

The answer to what I use now is: a bit of SQLite and some flat files. No need for HDFS, it still fits easily on a hard disk.

schwarzmx · on Dec 31, 2015

Have you tried Stardog [1]? Stardog can handle billions of triples and the upload process is pretty painless.

Disclaimer: I'm one of the developers of Stardog.

[1] http://stardog.com/

rspeer · on Dec 31, 2015

It needs to be free, open-source, and available for commercial use. I'm not the only person who ever builds ConceptNet, and my company is not the only company who ever builds ConceptNet. It would be unreasonable to ask downstream users to get a Stardog license.

Stardog doesn't even show me a price for putting a billion edges into it, just an e-mail link marked "INQUIRE", so I have to assume it would be very, very expensive.

schwarzmx · on Dec 31, 2015

I guess that answers my question.

FWIW Developer/Enterprise versions are free to try and Community doesn't expire.

rspeer · on Dec 31, 2015

Free to try, expensive to succeed.

Thanks for the offer, but I'd only go with that model if there were no other options, and right now you're competing with SQLite and a filesystem, which have no additional costs.

nicolewhite · on Dec 31, 2015

I understand your pain around Neo4j's past import capabilities; when I found it around two years ago I had a lot of trouble with this as well. But with the new neo4j-import tool [0] that's no longer an issue. I've imported ~50 million nodes and ~100 million relationships into Neo4j in under four minutes with neo4j-import.

[0] http://neo4j.com/docs/stable/import-tool-examples.html

rspeer · on Dec 31, 2015

Thanks, that sounds useful. I'll look into that when the site is back up.

david_p · on Dec 31, 2015

To my knowledge, Neo4j has been working on performances a lot. Neo4j v2.2+ does a much better job with large-scale graphs.

The use-case you describe (a couple million nodes + edges) sounds like a fairly reasonable task for Neo4j.

If Neo4j has been a bad experience for you, I also recommend looking into TitanDB v1+, which scales horizontally (backed by Cassandra), although the query language (Gremlin) is not as easy to learn as Cypher.

amirouche · on Dec 31, 2015

not sure about Cypher being easier than Gremlin.

lennartkoopmann · on Dec 30, 2015

... and Graylog CTO here. Ask us anything! :)

alanpost · on Dec 31, 2015

A question for both of you. I don't understand how, if you have it together enough to centralize your logging, you need to be told about interdependencies between your services. What am I missing? Do I underestimate how easy it is to set up centralized logging? Or how complex a deployment becomes before you even wonder if you need centralized logging?

I have a centralized logging system, but I can't imagine being so confused about it that I need my logs to tell me that two components interact with each other.

What don't I understand?

lennartkoopmann · on Dec 31, 2015

It is fairly easy to set up centralized logging. Interdependencies between services can be pretty complex. Think about it from the operations and not the development side: Which firewalls are between the public internet and service X? Why are two Windows workstations talking to each other? Which services are talking to an internal API?

The moment more than a few people are involved with your systems, it can get so complex that visualizing dependencies can be extremely helpful and bring a lot of insights.

alanpost · on Dec 31, 2015

Your appeal to think about it operationally helped me understand. I have to answer questions like the ones you pose whenever something odd pops up in the logs. Thank you.

lmeyerov · on Dec 31, 2015

As a concrete example: we work with enterprises with people numbering anywhere from 10K to 500K to government-scale, and each person may have a desktop/laptop/phone, and all the servers/printers/switches those connect to, and at the logical layer, all the applications and services for making it useful. We'll see multiple central logging systems, hierarchies of administrators, and the results of mergers, acquisitions, and one-off or zombie projects. These organizations are getting sophisticated enough to log 10M, 1B, etc. alerts a day (ex: using graylog or splunk), so we need to focus on the next step of being able to point to one alert and asking what's happening around it.

It's a really fascinating data problem, so we've been loving building tools for seeing into it!

henrikjohansen · on Dec 31, 2015

Indeed it is. Ingesting 100k events per second into one or more centralised log management platforms will not be efficient it your're relying on a row-based analyses approach.

henrikjohansen · on Dec 31, 2015

Image having to do that for 30k machines, 8-9k access points across 100+ different locations accessing hundreds of different systems - it does not work efficiently with visualising the dependencies automagically.

atombender · on Dec 31, 2015

Tracking flow information about networks and apps on Linux is something I have been thinking about. I'm wondering how other people are doing it.

For networking in general, I suppose you could sample /proc/PID/fd and /proc/PID/net/tcp regular intervals, though it would technically miss some connections.

For apps — specifically, microservices — I'm thinking that every app could be modified to emit pairs [from, to] to statsd, which can then be used to transfer the data to a central collector. The downside is that every RPC request has to do this, in all the languages your microservices are written in.