
SenseiDB: Open-source, distributed, realtime, semi-structured database - thomas11
http://senseidb.com/
======
untog
I can't speak for SenseiDB itself, but I think that LinkedIn deserves a lot of
credit for the technology they have been open sourcing in the last few months.
Fantastic to see.

------
agentultra
Looks like the marketing copy is off from the actual implementation, AFAICT.

    
    
      Sensei (先生) means teacher or professor in Japanese(http://en.wikipedia.org/wiki/Sensei).
    
      It shares the same pronunciation and writing with the Chinese word that has the same meaning. This name indicates that the system can be used in place of Oracle database in many applications.
    

Okay, so on the name alone I can replace my Oracle database! Great!

Seriously though it goes on...

They claim the database is ACID: <http://javasoze.github.com/sensei/data-
guarantee.html>

But they built the entire thing around "eventual consistency."

And statements like:

    
    
      "Sensei provides a high-level of durability by maintaining N replicas of each shard to guarantee a level of availability and fault-tolerance"
    

Don't seem to make sense when talking about ACID given that a _write_
operation will happen _at some point_. Looks like the data event producers
will shard the data across N replicas without quorum... so there's no
guarantee that there will be N replicas available... is that right (and that
the transaction won't be lost mid-stream either)?

Skimming through the source it doesn't seem to be doing anything terribly
revolutionary... and I can see the usefulness of the trade-offs they made in
this database for certain scenarios. However I don't think the claims of ACID
guarantees and "real time" are particularly representative of what this DB
will actually do. They just don't seem to jive with "eventual consistency"
models.

I'm not a hardcore database guru though so maybe I'm missing something?

~~~
joshhart
Where does that page claim perfect ACID semantics? It's meant to describe what
Sensei gives you for each aspect of ACID.

Sensei needs an event stream to process. We've open-sourced and apache-fied
Kafka which is a great candidate for an event stream. For Atomicity and
Isolation, the event stream must provide these guarantees.

Consistency is handled with a routing parameter. Requests partitioned around
an id will always go to the same searcher, so they won't go backwards in the
stream except in failure scenarios. This is eventually consistent, but tries
to keep things sane.

Durability: The event stream helps with this. We don't immediately flush while
indexing in Lucene, so if there's a crash we can replay the persistent event
stream.

Does this make sense? SenseiDB is not intended for purely transactional
processing. For some applications, sensei would make a good candidate for
replacing your DB. For others, not so much.

~~~
agentultra
_Where does that page claim perfect ACID semantics? It's meant to describe
what Sensei gives you for each aspect of ACID._

And they're all well and good features! I can tell SenseiDB isn't
transactional. Like I said, I skimmed the source and understand at a high
level what it's does. I could see it being very useful in certain conditions
as LinkedIn currently does and I'm sure others will.

However, I think the copy is confusing (at least it was for me). On the
guarantees page there's an "ACID-ity" headline. For each aspect of ACID, as
you say, the page describes what Sensei offers. The confusing part was that I
was mentally comparing each aspect against what I understand to be the common
semantics of ACID. I think it would be more clear if there was some
distinction under the main headline that acknowledges this difference.

hth!

------
joshhart
Looks like someone non-LinkedIn posted this so the rest of the team is
unlikely to be around for a few more hours. Let me know if you have any
questions! I worked mostly on the distribution & rpc piece.

~~~
threepointone
Congratulations on the project, and compliments on the breadth of
documentation. Nicely done.

(Nitpick - none of the urls change. Some js error where you're not doing a
pushState? You should report this.)

edit - this works -<http://senseidb.github.com/sensei/index.html>

~~~
joshhart
Thanks! I'll make sure that gets corrected.

------
opendomain
3 separate NoSQL stores from ONE company? And their primary business is not
Databases? There are currently more than 100 different NoSQL solutions - I
believe we will see a consolidation of NoSQL in 2012

~~~
efnx
On the flip side, we might see a lot of "build your own NoSQL store" tutorials
in 2012.

------
andres
SenseiDB looks really impressive. We were actually thinking about the same
problem when we built ThriftDB (<http://www.thriftdb.com>) for Octopart so
it's especially interesting to see LinkedIn's implementation.

Looking through the documentation a couple of things come to mind:

* Does SenseiDB support nested data structures?

* What happens when you modify the schema? Can you add/delete/modify attributes?

~~~
javasoze
re: nested data structure is in the roadmap, but no target date yet. There are
2 ways of implementing this: 1) pre-pend field name reflecting the nesting,
much like elasticsearch 2) encoding payloads in the term list indicating
nested nature.

1) is straightforward, but can be restrictive if you want to in-nest faceting.

We are still debating on the route to take.

As for schema changes, we should document that better: certain schema changes
simply requires you bouncing the node, but if there are data integrity
changes, you would need to reindex. We will work on more documentation on the
specifics.

~~~
ericmoritz
If I change the faceting does bouncing the nodes cause a re-index
automatically or is it more involved?

~~~
javasoze
bouncing node will start reindexing from the last committed version on the
disk. If the index is up to date, no re-indexing will be triggered.

------
hello_moto
How many home-grown databases have linkedin developed in the past few years?

~~~
strlen
How many programming languages do you use? :-) Just like we use Python, Java,
Scala, etc... where each is appropriate, we also use different databases where
appropriate.

Each system serves a need. We also use off the shelf databases, where
appropriate: it should be noted Sensei uses Lucene (a well known search
library), Voldemort has a pluggable storage engine (where we mostly use
BerkeleyDB and a custom read-only storage engine), and Espresso uses MySQL.

We don't build databases for reasons of NIH: we focus on building features
(faceting, real-time indexing, partitioning, fail over, etc...) that enable us
to build fast, usable, feature rich, scalable, and reliable applications. We
readily use open source components in many places within both our
infrastructure systems (search, various databases, Kafka) and in the
applications, and contribute to existing open source projects.

------
javajosh
Options are great (and thanks for contributing to the OS community), but the
biggest barrier to adoption will be confusion about all the NoSQL solutions
floating around right now (including two others from LinkedIn). Therefore, I
recommend that you spend more time on the homepage comparing Sensei with other
NoSQL stores. Perhaps take a two-tier approach for those coming from Oracle or
MySQL: why NoSQL, and if NoSQL, why Sensei NoSQL?

------
franklovecchio
Looking at its config, I'm reminded of Cassandra 0.6, but then I see the
ZooKeeper functionality. Interesting. Love the open-sourcing! I showed an
example of how to distribute inter-mingled DBs using MQtt a bit ago, but never
posted here: [http://franklovecchio.tumblr.com/post/13051814890/mqdb-
distr...](http://franklovecchio.tumblr.com/post/13051814890/mqdb-distributing-
databases-with-mqtt).

One thing is for sure -- NoSQL isn't so mysterious anymore :)

------
ericmoritz
So to get data into this database, I have to have a Gateway implementation in
Java?

Would I have to publish messages to Kafka in my language of choice to avoid
writing any Java?

~~~
javasoze
There is a Kafka gateway packaged with Sensei, see sensei-gateways module.

We love Kafka!

------
zrail
From a quick glance at the docs there's no database-side aggregations. Is that
correct?

~~~
joshhart
The facets let you support counting and group by - is there something else you
had in mind?

Edit: You can get a pretty quick idea of what's supported the query language
at <http://senseidb.github.com/sensei/bql.html>. Since Sensei was designed
with document and text uses cases in mind, we have plans to support advanced
queries where a relevance model is part of the query, allowing you to perform
a custom sort on the server.

~~~
zrail
sum() and distinct() come to mind for sure, but I don't know how easy they'd
be to implement. The particular use case I have in mind is where we have
events streaming in and we usually don't care about the actual events, just
various sums, counts and distinct counts over them grouped by various
dimensions.

------
dougk
Can Sensei be used as a replacement to MySQL + lucene?

~~~
LukeHoersten
Yeah, it'd be nice to see a side-by-side against other common DBs like MySQL
and Oracle.

~~~
Ecio78
there's a comparison vs Mysql here:
<http://senseidb.github.com/sensei/performance.html> for insert and query
performance

------
foobarbazetc
This is awesome.

Thanks for sharing this with the world. :)

------
cultureulterior
Is it single-point-of-failure free?

~~~
joshhart
Yes. It's a partitioned system - you just need to have enough replicas per
partition. Clients track cluster state by registering listeners to zookeeper.

------
ruslansv
A truly amazing peace of technology

------
jamesu
Nice looking wheel. I have a feeling it could be rounder though.

