

Introduction to Behavioral Databases - benbjohnson
http://blog.skylandlabs.com/introduction-to-behavioral-databases/

======
pbnjay
First part is a nice little intro, good job!

But, I'm sorry the pitch that follows doesn't make sense. If you want to sell
us on your solution, you have to accurately lay out the problem you're
solving. You're just laying out a bunch of common DB implementation issues,
and expecting us to conclude that your DB is great at processing behavioral
data.

First, you talk about transactions being slow, sure. But that's for updating
data, which happens much less frequently than inserts or queries. Second, you
talk about "spatial locality" as if none of the big RDBMSs can do clustered
data, prefetching, or caching. (Does your new DB do these?) Third, you
describe how you can do aggregation 45 million times per second on a single
core.... woo fast addition. Finally, you will require the use of another query
language to do anything.

How will your implementation compare to existing systems? Why will yours
specifically be faster/better? How will your applications perform better than
say, writing applications on top of neo4j?

\- I hope this doesn't come across too snarky, but I am trying to help you out
by being honest.

~~~
benbjohnson
I appreciate the feedback. I don't think you're being snarky at all (except
maybe the fast addition comment). :)

The post was meant to give a short, somewhat technical overview of the
technology. It's a hard balance between too much information and too little.
I'll try to elaborate at length on your points in future posts.

But here are a few short answers:

1\. Transactions - These are used for updates, inserts and deletes in an
RDBMS. Queries can also use MVCC which incur performance issues and
fragmentation. Sky is transaction free (since events are independent) and MVCC
may be a feature in the future but is not currently planned and would not be
enabled by default.

2\. Spacial Locality - Yes, some databases allow for clustering (e.g. Oracle
Hash Clusters, Postgres has manual clustering), but Sky does it by default.
Caching is built-in but prefetching hasn't been added yet.

3\. Aggregation - The benchmark was just to give a baseline of data access. I
don't think anyone will be impressed by the addition part. The idea was to
show that there isn't a lot getting in the way (such as locks, latches, etc)
when traversing data. More complex querying will obviously go slower. That
being said, computing a directed graph (or funnel analysis which is similar)
is a very common use case for viewing behavior.

4\. Querying - The query language is essential because it means aggregation
takes place where the data is. Transferring the data to a set of Hadoop
servers running Java code will kill performance from sending millions of
entries over the network (not to mention the network bandwidth required). The
domain specific language also has natural consequences such as creating real
time queries and some other cool features.

5\. Comparison - I'll provide some comparison benchmarks in the future. Sky is
meant to excel in performance and ease of analysis. For the specific use case
of behavioral data, Sky should be significantly faster and easier to use than
what's currently available. That includes Neo4j. However, that's comparing
apples to oranges. Sky is built around analyzing time series data and Neo4j is
built around graph data. They're two fundamentally different types of data
access. Sky would fall flat on its face if you tried to use it as an OLTP or
graph database.

The blog post isn't meant to talk badly about existing database technologies.
I was an Oracle DBA for years and I have a huge appreciation for what
relational databases can do. But most databases are meant to be general
purpose. I saw that many people used tools like Hadoop to process log files
because a general purpose database can't handle the load and out of that I saw
a common use case that deserved its own database.

Let me know if I can address any other questions.

