Show HN: SummitDB – In-Memory NoSQL DB

qwertyuiop924 · on Oct 12, 2016

Can somebody get us an on disk, small, fast, in-process, low-footprint NoSQL DB? Thus far, I've been pressing SQLite into use for a lot of monotyped data that would have been better served by a NoSQL DB, but I couldn't find one suitable.

Apparently, in-memory DBs in now. They're fast and all, that's nice, but I'm still running on a machine with 8 Gigs of ram, and when the dataset starts to exceed that, I want an option other than "grab some more DDR3." I remember the fate of TinyMUD: on-disk operation must at least be an option for any DB I'd use for datasets of a non-fixed size.

preetamjinka · on Oct 12, 2016

There are several key-value stores you can choose from. RocksDB, LevelDB, lmdb, Berkeley DB, Tokyo Cabinet, sophia, and others. What about those? You can always build abstractions on top.

lopatin · on Oct 12, 2016

BoltDB if you'd like to embed it into a Go program.

chaotic-good · on Oct 12, 2016

BoltDB can be corrupted on crash (kill -9 or hard reset) at least this was the case when I gave it a try. Any database can but in that case that was easy to achieve.

mbertschler · on Oct 12, 2016

did you file a bug for this? I thought it was pretty resilient in such cases

chaotic-good · on Oct 12, 2016

I've found several submitted issues that described this problem. All this issues are closed now so maybe I was wrong about Bolt.

qwertyuiop924 · on Oct 12, 2016

Thanks for the suggestions. Some of those might work. The only problem with k/v stores is the serialization/deserialization cost, as most of them can only store strings.

erichocean · on Oct 12, 2016

> The only problem with k/v stores is the serialization/deserialization cost, as most of them can only store strings.

I store Cap'n Proto objects directly into LMDB, and can read them without any memory allocation or deserialization cost. Freaking rocks.

ddorian43 · on Oct 12, 2016

lmdb has some special cases where you can store capnproto/flat-buffers objects and read without copying/deserializing!

qwertyuiop924 · on Oct 12, 2016

Really? that could be useful...

ddorian43 · on Oct 12, 2016

LMDB can hand you back a pointer to your object in the memory map, and we can easily use that pointer with Cap'n Proto without any calls to malloc(). source: https://news.ycombinator.com/item?id=12394385

chaotic-good · on Oct 12, 2016

You can try LMBD - https://symas.com/products/lightning-memory-mapped-database/ In my opinion there is exactly two production ready embedded databases - SQLite and LMDB. At least this two databases have decent crash recovery mechanisms (I've tested both). I haven't tested all of them, maybe ForestDB and BerkleyDB are good as well.

qwertyuiop924 · on Oct 12, 2016

The only question is this: Is the dataset required to fit into RAM? Because if it is, than I haven't fixed the problem, now have I...

erichocean · on Oct 12, 2016

> The only question is this: Is the dataset required to fit into RAM?

I run an LMDB dataset many times the size of RAM, works great. Most data isn't recent, so the virtual memory cache works extremely well.

qwertyuiop924 · on Oct 12, 2016

Okay. Good.

_pgmf · on Oct 12, 2016

http://engine.so/ -- quite a nice list of small, fast, in-process databases.

erichocean · on Oct 12, 2016

> Can somebody get us an on disk, small, fast, in-process, low-footprint NoSQL DB?

That's LMDB, as others have noted. I have used it in production for years, never once lost data. Insanely fast, low footprint, crash-proof.

(P.s. I have seen SQLite get corrupted regularly on mobile devices, which is why I no longer use it there.)

qwertyuiop924 · on Oct 12, 2016

Bizarre. SQLite is known for being hard to corrupt. How did it happen?

erichocean · on Oct 12, 2016

No idea, but with about 50,000 regular users hitting the local SQLite DB on their mobile devices hard, I'd get a corrupted DB report a few times a week.

It's possible my SQLite configuration wasn't as crash-proof as the default install. I moved onto another project before I could investigate further.

not_kurt_godel · on Oct 13, 2016

Seems possible you were just seeing the rate at which people's storage hardware gets corrupted rather than an issue with SQLite.

bruno2223 · on Oct 12, 2016

I use LevelDB, it is a key-value database, from Google.

I am using it with Nodejs, on t2.micro, handling 10+ million requests / day

Never had an issue. in production for one year already. Really good :-)

tracker1 · on Oct 12, 2016

There's also a lot of higher level abstractions over leveldb at this point.

qwertyuiop924 · on Oct 12, 2016

Level could work, but it depends on how expensive serialization is.

brightball · on Oct 12, 2016

Mnesia built into Elixir/Erlang potentially?

alixaxel · on Oct 12, 2016

Have you heard about TieDot?

maerF0x0 · on Oct 12, 2016

make a gigantic swap file?

SteveNuts · on Oct 11, 2016

Someone should invent a package manager for databases, the same way we have npm, composer, etc.

it's getting difficult to manage all my different databases.

arielweisberg · on Oct 11, 2016

Totally! I can just sudo apt-get install low-latency-strong-consistency-multi-dc-with-declarative-query-language-db as a virtual package and get the right one for my distro.

manigandham · on Oct 12, 2016

Great options already exist:

MemSQL for blazing fast distributed full SQL database with cross-datacenter replication and in-memory rowstore + disk-based columnstore.

ScyllaDB for Cassandra rewritten in C++ for blazing fast dynamo-style multi-master multi-datacenter disk-based wide-column database.

I'll also throw in AMPS (by 60East) as the best messaging platform that supports innovative SQL and real-time state-of-the-world queries on it's message streams (instead of using that rabbitmq or kafka crap).

akbar501 · on Oct 12, 2016

https://github.com/Netflix/dynomite

Dynomite for production proven, limitless scale for Redis.

Provides ability to Redis for both in-memory and on-disk workloads.

Dynamo-inspired linearly scalable, shared nothing multi-master architecture. Pairs well with Cassandra.

Supports pluggable backends (ex. RocksDB).

Been in production 2+ years. One of the larger clusters handles over 3.6 million sustained ops/sec in production, every day.

ddorian43 · on Oct 12, 2016

Redis is only the api/driver, not the features, right ? Meaning it's just crud and not the many features/data-types that redis provides, correct ?

akbar501 · on Oct 12, 2016

It's the full Redis API and protocol. See https://github.com/Netflix/dynomite/blob/dev/notes/redis.md for a list of all supported Redis features.

The benefit of Dynomite's support for the Redis API + protocol is a.) you can use any Redis client and b.) the same code can be used for standalone Redis on your laptop and on a distributed Dynomite cluster.

manigandham · on Oct 12, 2016

This is neat, but why use this over just using Cassandra-based DBs?

Also you were a speaker at Data Layer right?

akbar501 · on Oct 12, 2016

Yes, I was a speaker at DataLayer.

There are two answers to your first question.

1. Dynomite pairs well with Cassandra. At scale, Cassandra is not a speed deamon when it comes to reads. Dynomite helps to improve Cassandra's read performance.

2. The use case for Dynomite as the primary database is for workloads that require high throughput and low latency. In other words, Dynomite delivers consistently lower latency at any scale.

manigandham · on Oct 21, 2016

Better to just use scylladb which natively handles it rather than run another layer of software.

ddorian43 · on Oct 12, 2016

Memsql + AMPS have "contact us" pricing unfortunately. Do you know any sharded queue (like amazon sqs,iron-mq) open-source ?

AliaksandrH · on Oct 12, 2016

Some companies are closed source for a really good reason: it allows for Vertical Integration and Focused Execution. Bay Area companies spend millions of dollars worth of engineers salaries in projects that layer complexity upon complexity (and you need dozens of engineers to maintain them), just because they are in love with idea of open source. This is dangerous proposition, and it stifles innovation. There is an opportunity to dramatically simplify a lot of data pipelines. Check this out: http://docs.memsql.com/docs/pipelines-overview

ddorian43 · on Oct 12, 2016

I understand that but not everyone lives in SF or creates a microservice for each function they have hosted in an overpriced ec2-vm inside a subpar docker-container using json-serialization.

You can have (open-source or free) and develop without community (i think like chrome does that).

The question is: is it worth it ? Without vc-money-cash it's a little hard AND time-consuming.

manigandham · on Oct 12, 2016

Yes, those are both closed source but great companies that will work with you. Happy to put you in touch if you want. MemSQL also has a completely free community edition.

For open-source fast and clean messaging, I'd recommend NATS, they recently built NATS streaming that builds on top of the pub/sub to include kafka like persistence - http://nats.io/

ddorian43 · on Oct 12, 2016

Yes but memsql community doesn't have high availability (voltdb does the same thing).

I don't think nats can be used like I said by looking at the docs.

manigandham · on Oct 12, 2016

It has clustering: https://nats.io/documentation/server/gnatsd-cluster/

What are you looking to do? Sharding for what? Data size? Throughput?

nickpsecurity · on Oct 11, 2016

Is that the command for installing Google's F1 RDBMS? It has those traits. I really wish they'd spin it off as a product since the only startup offering something similar got acquired and deep sixed by Apple.

atombender · on Oct 12, 2016

Take a look at CockroachDB, which had its first public beta earlier this year [1]. It's directly inspired by Google's F1 and Spanner projects.

It's similar to FoundationDB (the Apple product you're referring to) in that it's an SQL database layered on top of a K/V store. It's based on some clever technology to accomplish distributed transactions, strong consistency and high availability, and is looking very promising.

[1] https://www.cockroachlabs.com/

nickpsecurity · on Oct 12, 2016

Appreciate the tip. Im aware of it. They had a recent post on HN where they were just getting around to dealing with stability issues. I was really excited till I saw that. Im holding off for a while till it matures a bit.

ddorian43 · on Oct 12, 2016

I think they just went too far with full-sql-joins-indexes-column-families in the first 1.0 version. Better to build it little by little.

carterehsmith · on Oct 12, 2016

"First public beta" of a database product? No.

arielweisberg · on Oct 11, 2016

I don't think it meets the low latency requirement. It's a CAP theorem joke. Can't have strong consistency across multiple data centers with low latency because you have to read/write from a quorum of DCs in order to maintain availability in case one DC goes down.

nickpsecurity · on Oct 12, 2016

Oh yeah. Low latency eould be saying too much. I think its worst-case for strong-consistency is around 30 seconds or so.

Anyway, closest to to low-latency part would be NonStop or OpenVMS clusers across redundant, leased lines. They're fast, scale well at HW level, use ACID databases, and high-availability. Commonly used in transaction processing. One VMS cluster survived 9/11 attack without losing a single transaction. NonStop does up to five 9's. Id be interested in the CAP analysis of these older methods.

arielweisberg · on Oct 12, 2016

It also depends on your definition of data center. It's really multi-region that is the problem since then it's the speed of light you have to deal with.

Quorum across some data centers is fast enough that it's not hard to manage in new applications.

nickpsecurity · on Oct 12, 2016

That makes a lot of since. Many of the NonStop and VMS clusters were in same country but far enough from each other to isolate against many disasters. Could be why they did better on the issue.

mirekrusin · on Oct 12, 2016

You can always use quantum entanglement to help a (qu)bit.

biokoda · on Oct 12, 2016

http://www.actordb.com/

tshannon · on Oct 11, 2016

Competitions good, imo. We have an "Embarrassment of riches", as it were, when it comes to databases.

m1sta_ · on Oct 11, 2016

And yet nothing that meets all my requirements. Still so much room to improve!

fiatjaf · on Oct 12, 2016

This is exactly like what I wanted Redis to be. Secondary indexes solve all the problems.

However, I wanted a feature that allowed custom indexes with Javascript, like CouchDB map functions.

tidwall · on Oct 12, 2016

This is available with SummitDB. The SETINDEX command has an EVAL option that allow for custom JavaScript indexes.

dvirsky · on Oct 12, 2016

I'm working on a secondary index implementation in redis using modules, that should be out in a few days. It indexes redis hashes similar to how you index tables.

ddorian43 · on Oct 12, 2016

Check out redis-search module which I think does that (only for keys for now) https://github.com/RedisLabsModules/RediSearch/commit/86eb1c...

dvirsky · on Oct 12, 2016

I wrote that module as well :)

The secondary index module will be more focused on automating more traditional indexing of numbers and simple string values. It has an SQL-ish language for WHERE expressions, and then the result of that is piped to any redis command you want it to.

If you're interested, here's a draft of the syntax (it's changed a bit since but you'll get the idea). https://gist.github.com/dvirsky/3ef73143a6d8212f2b50096a8eb6...

ddorian43 · on Oct 12, 2016

Oh, I should've seen the usernames. I'm not interested on the secondary-indexing, mostly on the full-text-search. It could grow on something big (compared to the benchmarks on redislabs https://redislabs.com/blog/adding-a-search-engine-to-redis-a...)

dvirsky · on Oct 12, 2016

Yeah, that blog post is also me :) hehe. Anyway glad to hear you find it interesting. If you want to help out ping me. There's tons of stuff on the roadmap, and currently I'm almost entirely focused on the secondary index.

ddorian43 · on Oct 12, 2016

I still see some issues with it: 1. redis being single-thread, where inverted-indexes are ~easily sharable by cores (i think based on what I've read). This making hotspots easier(since it's a single core not single machine) 2. sharing of data(terms) between multiple indexes on 1 server (like in elasticsearch you use _routing, but all things are in 1 index, though some people like separate-per-user like dropbox does) 3. cluster not fully nice yet (losing writes) 4. no option to merge from different nodes (or even redis-api to do so as far as I know)

dvirsky · on Oct 12, 2016

Salvatore just added support for asynchronous operations in side threads in modules. This would allow merging results from the cluster possible, and I want to get to it soon. I'm now implementing it in the client side and it works great, but I want it to be as transparent as in elastic.

The recent additions would also allow more parallelization of the indexes for reads. I could create RW mutexes in the engine and allow multiple clients to work on the same term index. It won't be trivial but it's not super hard as well.

dvirsky · on Oct 12, 2016

Anyway, if you want to continue the discussion further, we can take it to reddit (I haven't posted the benchmarks there yet, it's a good idea anyway), or just email me, dvir at redis labs.

skrebbel · on Oct 12, 2016

I can't find from the docs at all whether this persists to disk.

If not, what is the use case? Why would I need all those ACID-y guarantees if my server can fail at any time and all data is gone?

pasta · on Oct 12, 2016

In-Memory could be used for processing data. Or as temp tables.

And by the way: a lot of existing databases support this.

tidwall · on Oct 12, 2016

SummitDB does persist to disk. Shutting down or kill -9 the server will not lose data.

deforciant · on Oct 12, 2016

looks great, I really like BuntDB, already using it in project :) Glad to see this new SummitDB, will definitely try it out!

ddorian43 · on Oct 12, 2016

Looks like it has no sharding unfortunately. Do you have any info/eta/idea on this op ?

tidwall · on Oct 12, 2016

Yes it is something that I certainly want soon, but there's no ETA at the moment. I haven't yet fully flushed out the strategy for sharding the key space.