
Amazon Neptune – Fast, reliable graph database built for the cloud - irs
https://aws.amazon.com/neptune
======
lmeyerov
Awesome surprise to see the embargo lifted -- sounds like I can now say the
Graphistry team will be doing a follow-up talk at Amazon Re:Invent tomorrow
(Thursday) on Amazon Neptune + Graphistry. We've been incorporating this into
visual investigation workflows for security, fraud, health records, etc.
They've been doing cool bits on the managed graph layer, and were early to
graph GPU tech (Blazegraph team members), and our side starts bringing that
kind of thinking to visual GPU analytics & workflow automation tech.

If you're in town and into this stuff, ping me at leo [at] graphistry, and
would love to catch up Th/F for coffee+drinks. Also here + email, of course!

------
kendallgclark
If you actually read the docs, it's not Janus-based, it's based on BlazeGraph,
which Amazon reportedly acquihired last year.

~~~
jnwatson
Is that public information? I don't see any press releases about it.

~~~
kendallgclark
There was no PR. But there are traces, like Amazon acquired the domains, etc.
Many former Blazegraph engineers are now Amazon Neptune engineers according to
LinkedIn, etc. It was rumored widely in the graph db world fwiw.

~~~
nicolastorzec
To add on Kendall's comment.

Amazon owns the BLAZEGRAPH trademark:
[https://www.trademarkia.com/blazegraph-86498414.html](https://www.trademarkia.com/blazegraph-86498414.html)

Blazegraph's CEO is currently at Amazon as Principal Product Manager:
[https://www.linkedin.com/in/bradley-
bebee-a15764b/](https://www.linkedin.com/in/bradley-bebee-a15764b/)

------
rdslw
Yet another amazon service to lock you in.

And then after two years, when you're no longer startup with 100usd bill, but
bigger company, you're completly tied to a jungle of amazon products, and your
exit strategy is very very costly.

clever amazon, clever.

~~~
derefr
There is some truth to this, but in a larger sense (on an ecosystem level,
rather than from the perspective of an individual company), I can only be
happy when AWS enters a new space. It makes that component into table-stakes
in the IaaS game, which means every other big player is about to step up with
their own offering as well, and the third-party SaaS and open-source self-
hosted offerings in the same space all are going to heat up as well.

Consider the evolution of container hosting services: first we had PaaSes like
Heroku with proprietary container formats; then we got Docker, but Docker
Swarm was nascent and there was no serious Docker Swarm IaaS-cloud offering.
But then, very quickly, AWS built ECS; Google responded with Kubernetes; and
then Kubernetes became the open standard, made everyone forget about Docker
Swarm, and took over (and is even replacing ECS now.)

 _That 's_ what happens when AWS enters a space. And it's great.

~~~
PaulHoule
It supports RDF/SPARQL which gives you migration options to twenty or so
triple stores such as rdflib, Jena, Virtuoso, AllegroGraph, etc.

No lock in at all.

------
lolive
Just an off-topic comment: i am the maintainer of a visual query builder for
SPARQL queries. cf [http://datao.net](http://datao.net)

This tool proposes to design query patterns from a graph data model, via drag
n drops. The tool can then compile the patterns as SPARQL, run them on an
endpoint and format the results as map/forms/tables/graphs/HTML (via
templating)/...

Another service of Datao ([http://search.datao.net](http://search.datao.net))
proposes a search-engine view of those queries so you can type the textual
representation of an object in any public SPARQL endpoint, and the service
will list the queries currently available in Datao that can be applied upon
this object. You can then run these queries with a click, and get the HTML
templating of the query results.

Feel free to have a look at the website, if you find any interest in this
tool. ANy feedback is welcome.

PS: Sorry for the poor quality of the videos. I manage this project on my
spare time :)

------
randomor
Only had experience with Cypher, really liked it. It will be interesting to
see how Neo4j responds to this. Regardless of tech specs, the fully-managed
Neptune vs a community version on AWS Marketplace seems to give Neptune unfair
advantage.

~~~
mcphage
> seems to give Neptune unfair advantage

What do you mean by "unfair" here?

------
Graphguy
Is this JanusGraph under the covers? Guessing since Neptune is a nod to Janus.

~~~
igravious
[http://janusgraph.org/](http://janusgraph.org/)

Support for various storage backends:

    
    
       - Apache Cassandra®
       - Apache HBase®
       - Google Cloud Bigtable
       - Oracle BerkeleyDB
    

I don't understand how a _database_ doesn't have its own native store. What
exactly does a graph database actually _do_ if it doesn't manage the data fed
to it? Same is true for CayleyGraph†
[https://github.com/cayleygraph/cayley](https://github.com/cayleygraph/cayley)
and proabably others.

†Plays well with multiple backend stores:

    
    
       - KVs: Bolt, LevelDB
       - NoSQL: MongoDB
       - SQL: PostgreSQL, CockroachDB, MySQL
       - In-memory, ephemeral

~~~
rajman187
There are two main paradigms here

1) "native" graph db Neo4J is an example of this. This takes advantage of
index-free adjacency. Each node knows what other nodes it is connected to and
hence traversals are very fast. The issues you run into are when you try to
scale. Data that fits onto a single machine is fine and you can replicate your
data for fast parallel reads/traversals across disparate regions of a massive
graph. However you no longer have the concept of data sharding and
distributing the graph as index-free adjacencies don't translate across
physical machines. And another drawback is highly connected vertices, you will
expend a tremendous amount of resources deleting or mutating a vertex with,
say, 10^6 edges. But that vertex is probably a bot so you should delete him
anyway.

2) inverted index graphs, non-native graphs, whatever anti-marketing name it
might have. These rely on tables of vertices and other tables of edges.
Indexes make them fast, not as fast for reads but very fast for writes. And
you get distributed databases (Cassandra, for example, a powerful workhorse of
a backend with data sharding and replication factor, etc.). But then you have
to yet another index to maintain and the overhead can get expensive. This is
the model adopted by DataStax, who bought Titan DB (hence the public fork to
Janus) and integrated it with some optimisations and enterprise tools
(monitoring etc, solr search engine) to sit on top of Cassandra.

Both now have improved integration with things like Spark. Cypher is probably
faster than Tinkerpop Gremlin especially with the bolt serialisation
introduced in recent versions of neo4j.

So janus is the graph abstraction layer of the second type and so needs
somewhere to save these relationships. It all comes down to use case (and
marketing) to decide what works best for you.

~~~
Gulthor
Recommended reads on the native vs non-native topic:

* [https://www.datastax.com/dev/blog/a-letter-regarding-native-...](https://www.datastax.com/dev/blog/a-letter-regarding-native-graph-databases) (tldr; there is no such thing as a native graph database)

* [https://neo4j.com/blog/note-native-graph-databases/](https://neo4j.com/blog/note-native-graph-databases/) (tldr; native graph databases do exist)

Regarding Cypher vs Gremlin: serialization could be a thing but what matters
among other things are efficient query optimizations, algorithm and (physical)
data model. Ultimately, databases are all reading from 1-dimensional spaces
(RAM or disk), either randomly or (best) sequentially. If you can colocate
vertices with their respective edges, you're fine: this is trivial for graphs
with no edges or graphs that form a linear chain. If not, then things start to
become fun, especially in a distributed way. This will impact performance; the
language, not so much.

~~~
rajman187
I'm familiar with Marko and his arguments hence my quotes around "native" ;)
But it sounds fantastic for marketing

------
brianbreslin
Can someone explain to me in lamens terms what a graph database is?

~~~
mcphage
It's a database that's designed to store relationships between objects instead
of just facts. It has efficient methods of following long chains of
associations. So think of how you store tree structures in a relational
database—there are a lot of different ways of doing it, and they're all
frustrating. Storing trees is something graph databases do naturally.

~~~
michaelbuckbee
Trying to get this straight in my mind here.

Is it fair to say that traditional RDBMS/SQL are for storing different "sets"
of related information (tables for products, users, orders).

Graph databases are for storing data about the _same_ set of data as it
interrelates to itself.

\- a User and and all their Friends (who are also users) \- a Keyword and all
associated Terms (which are also keywords)

Is that right?

~~~
InverseFalcon
I think you're concentrating on the wrong thing, here.

Just as RDBMS can have tables about different things (Products, Users,
Orders), graph databases can use labels on nodes for different things (so you
can have :Product nodes, :User nodes, :Order nodes). Though with graph
databases, there is often less rigidity in the associated data than in RDBMS,
as there is no requirement for explicit schema for properties on nodes of
different types in a graph db (plus you can multi-label nodes).

The real differentiator is how relationships are modeled, and how they're
traversed in queries.

With RDMBS/SQL you're going to be working with data in tables, and use join
tables as the relationships between them. You're likely going to need to be
explicit about what is being joined together, so the relationship chain is
likely to be very rigid.

With graph databases, relationships and relationship traversal is used in
place of join tables and table joins, which gives much more flexibility over
how to traverse. You can certainly do friend-of-friend-of-friend queries much
more easily, but you can also perform variable-length traversals using custom
logic for which nodes are in the path and which relationships are traversed
(type, direction, and count), and that can be very well-defined, or very
loosely defined, or a mix, as needed. I don't believe there are good ways to
do that kind of ad-hoc table joining in RDBMS.

As an example of very loosely defined traversals in queries, you can ask for a
shortest path between two nodes, knowing nothing about the nodes or
relationships that could be between them, and get a path back showing the
connecting nodes, with the relationships between the nodes providing context.

------
chatmasta
It seems a lot of Amazon services are managed instances of open source
applications. For example, commenters are suggesting this may be based on
Janus. Elastic load balancers, at least originally, were likely based on
haproxy. Etc etc.

Has anyone ever considered the licensing implications of this? How is amazon
able to convert an open source product into a proprietary one and then charge
for access to it?

Of course you can argue they’re charging for the infrastructure management,
not the software itself. But that argument quickly breaks down as Amazon
introduces new software, under new names, with a proprietary management
interface over an open source core. Try to find the source code; you can’t.

And if you accept the premise that they’re just charging for hosting, then it
leads to the question of why an open source project doesn’t reap any benefits
from that hosting, or at the very least, from the management interface on top
of it.

It seems like a better solution would be something akin to AWS marketplace,
where open source projects are available to be hosted, and the maintainers can
see some revenue from them.

It seems like unfair rent seeking behavior that amazon is able to slap a
management interface on open source software and then charge for it under the
guise of “hosting.”

~~~
eitland
> How is amazon able to convert an open source product into a proprietary one
> and then charge for access to it?

Totally no problem with liberal licensed open source software.

This is also the intended behaviour of such licenses.

Also many of those big bad commercial companies contribute back big time to a
number of projects. Why? I guess sometimes because devs want to and also
because it makes sense business wise so they don’t have to maintain the code
themselves.

~~~
chatmasta
Depends on the license, doesn’t it? I’m not a licensing expert, but my
understanding is GPLv2 / copy left licensing means that if you create a
derivative product, you need to open source the new code along with the
dependencies.

Seems like a management interface is a clear cut derivative product. Where’s
the source code?

Or perhaps amazon _does_ consider licensing and only builds on top of, eg
Apache licensed projects?

~~~
eitland
Actually GPL allows you to keep your source code as long as you don’t ship the
software and only allows users to use it over the network. (The full truth is
a bit more nuanced.)

The newer AGPL closes this loophole.

And yes: except for Linux and the GNU tools I guess most companies stick with
Apache, BSD, Eclipse and MIT licensed software.

------
abalone
So I get that this offers simpler paradigm for graph data, but how should we
interpret the "fast & scalable" claim? Is it...

a) Slower than RDBMS/NoSQL but still pretty respectable, so it's a good choice
for things like offline analysis.

b) About the same at RDBMS/NoSQL, so you could use it to handle production
traffic if you want.

c) Faster, so you should definitely prefer it in production, e.g. for fetching
upvotes and comments on posts.

------
Varcht
Why "Neptune"? Having a hard time riddling that name out.

~~~
alexbilbie
Two other well known graph databases are "Janus" and "Titan" both of which are
named after ancient gods

~~~
dbenhur
Janus is a fork of Titan, BTW. The core Titan devs got acquired by Datastax
and redirected to their graphDB offering. Titan stagnated, then got forked as
Janus under the Linux Foundation.

------
joak
Are they using X1 ? [https://aws.amazon.com/ec2/instance-
types/x1/](https://aws.amazon.com/ec2/instance-types/x1/)

For efficient graph DBs it's better to have a lot of ram and cores ...

~~~
lolive
Or they choose a horizontally-scalable architecture, a la TitanDB.

Btw, anyone knows how such solutions handle cross machine traversals? Are they
schema-based? So the DB knows how to manage data locality and efficient
joins/traversals?

~~~
anonetal
I don't know about Neptune -- curious to hear what it is based on -- but
TitanDB never really supported cross-machine traversals for the execution
engine. The data was stored in a distributed fashion (across say a Cassandra
cluster), but any instance of the execution engine was single-machine, with no
easy way to talk between multiple instances of the execution engine.

~~~
luisdbosquez
One database service that supports horizontally scaled graphs is Azure
CosmosDB Graph API: [https://docs.microsoft.com/en-us/azure/cosmos-db/graph-
intro...](https://docs.microsoft.com/en-us/azure/cosmos-db/graph-introduction)

Worth to take a look if you need a managed Gremlin solution with some degree
of global distribution.

------
nicklasss
Super excited about this!!!!! BUT The preview link
([https://pages.aws.com/NeptunePreview.html](https://pages.aws.com/NeptunePreview.html))
is broken, can anyone at AWS team help us with that?

~~~
beebs_aws
[https://pages.awscloud.com/NeptunePreview.html](https://pages.awscloud.com/NeptunePreview.html)

~~~
xvf22
"Your storage cost will be $0.10 per GB-mon0h," on
[https://aws.amazon.com/neptune/pricing/](https://aws.amazon.com/neptune/pricing/)

------
lolive
1 point by lolive 14 hours ago [-]

I really hope Amazon will propose a facility to retrieve the RDFS data model
of an endpoint in a uniform way.

------
hmm_really
What inferencing does it offer to RDF?

How would I bolt on an inference engine to this if none is offered, i.e. to
provide OWL:RL?

------
arthursilva
It could be a modified JanusGraph frontend backed by DynamoDB.

------
alexchamberlain
For wider context here, is this leading the pack or do other public clouds
have competing products already?

------
Dryken
Sadly they use Gremlin that is so often said to have poor performances

~~~
makmanalp
AFAIK gremlin is just a query language - it shouldn't have much to do with
performance.

~~~
rajman187
Gremlin is indeed the query language but requires a gremlin engine. This is
generally passing strings to the DB (which gives you advantages like pushdown-
predecate, essentially DB-side filtering) but there is associated overhead
with something like Cypher that is now serialised and very fast with the Bolt
protocol

~~~
Dryken
that was my point but my Rhetoric was not as good as yours :)

------
bdcravens
Interesting that it doesn’t support GraphQL, but rather Gremlin and SparQL.
Surely that will impact adoption.

~~~
exogen
I love GraphQL and use it quite a bit, but the "graph" part of it is a bit of
a misnomer given all the existing graph database and query technologies. It
doesn't really offer anything in terms of interacting with RDF triples or
making complex graph queries. It has no relational algebra semantics or
ability to query relationships between arbitrary nodes, which is what folks
using graph databases typically want.

(I didn't downvote you though, it's a common misconception.)

~~~
db3d
> It doesn't really offer anything in terms of interacting with RDF triples or
> making complex graph queries.

SPARQL is supported.

~~~
exogen
Yes, I was talking about GraphQL.

