Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: SummitDB – In-Memory NoSQL DB (github.com/tidwall)
81 points by tidwall on Oct 11, 2016 | hide | past | favorite | 70 comments



Can somebody get us an on disk, small, fast, in-process, low-footprint NoSQL DB? Thus far, I've been pressing SQLite into use for a lot of monotyped data that would have been better served by a NoSQL DB, but I couldn't find one suitable.

Apparently, in-memory DBs in now. They're fast and all, that's nice, but I'm still running on a machine with 8 Gigs of ram, and when the dataset starts to exceed that, I want an option other than "grab some more DDR3." I remember the fate of TinyMUD: on-disk operation must at least be an option for any DB I'd use for datasets of a non-fixed size.


There are several key-value stores you can choose from. RocksDB, LevelDB, lmdb, Berkeley DB, Tokyo Cabinet, sophia, and others. What about those? You can always build abstractions on top.


BoltDB if you'd like to embed it into a Go program.


BoltDB can be corrupted on crash (kill -9 or hard reset) at least this was the case when I gave it a try. Any database can but in that case that was easy to achieve.


did you file a bug for this? I thought it was pretty resilient in such cases


I've found several submitted issues that described this problem. All this issues are closed now so maybe I was wrong about Bolt.


Thanks for the suggestions. Some of those might work. The only problem with k/v stores is the serialization/deserialization cost, as most of them can only store strings.


> The only problem with k/v stores is the serialization/deserialization cost, as most of them can only store strings.

I store Cap'n Proto objects directly into LMDB, and can read them without any memory allocation or deserialization cost. Freaking rocks.


lmdb has some special cases where you can store capnproto/flat-buffers objects and read without copying/deserializing!


Really? that could be useful...


LMDB can hand you back a pointer to your object in the memory map, and we can easily use that pointer with Cap'n Proto without any calls to malloc(). source: https://news.ycombinator.com/item?id=12394385


You can try LMBD - https://symas.com/products/lightning-memory-mapped-database/ In my opinion there is exactly two production ready embedded databases - SQLite and LMDB. At least this two databases have decent crash recovery mechanisms (I've tested both). I haven't tested all of them, maybe ForestDB and BerkleyDB are good as well.


The only question is this: Is the dataset required to fit into RAM? Because if it is, than I haven't fixed the problem, now have I...


> The only question is this: Is the dataset required to fit into RAM?

I run an LMDB dataset many times the size of RAM, works great. Most data isn't recent, so the virtual memory cache works extremely well.


Okay. Good.


http://engine.so/ -- quite a nice list of small, fast, in-process databases.


> Can somebody get us an on disk, small, fast, in-process, low-footprint NoSQL DB?

That's LMDB, as others have noted. I have used it in production for years, never once lost data. Insanely fast, low footprint, crash-proof.

(P.s. I have seen SQLite get corrupted regularly on mobile devices, which is why I no longer use it there.)


Bizarre. SQLite is known for being hard to corrupt. How did it happen?


No idea, but with about 50,000 regular users hitting the local SQLite DB on their mobile devices hard, I'd get a corrupted DB report a few times a week.

It's possible my SQLite configuration wasn't as crash-proof as the default install. I moved onto another project before I could investigate further.


Seems possible you were just seeing the rate at which people's storage hardware gets corrupted rather than an issue with SQLite.


I use LevelDB, it is a key-value database, from Google.

I am using it with Nodejs, on t2.micro, handling 10+ million requests / day

Never had an issue. in production for one year already. Really good :-)


There's also a lot of higher level abstractions over leveldb at this point.


Level could work, but it depends on how expensive serialization is.


Mnesia built into Elixir/Erlang potentially?


Have you heard about TieDot?


make a gigantic swap file?


Someone should invent a package manager for databases, the same way we have npm, composer, etc.

it's getting difficult to manage all my different databases.


Totally! I can just sudo apt-get install low-latency-strong-consistency-multi-dc-with-declarative-query-language-db as a virtual package and get the right one for my distro.


Great options already exist:

MemSQL for blazing fast distributed full SQL database with cross-datacenter replication and in-memory rowstore + disk-based columnstore.

ScyllaDB for Cassandra rewritten in C++ for blazing fast dynamo-style multi-master multi-datacenter disk-based wide-column database.

I'll also throw in AMPS (by 60East) as the best messaging platform that supports innovative SQL and real-time state-of-the-world queries on it's message streams (instead of using that rabbitmq or kafka crap).


https://github.com/Netflix/dynomite

Dynomite for production proven, limitless scale for Redis.

Provides ability to Redis for both in-memory and on-disk workloads.

Dynamo-inspired linearly scalable, shared nothing multi-master architecture. Pairs well with Cassandra.

Supports pluggable backends (ex. RocksDB).

Been in production 2+ years. One of the larger clusters handles over 3.6 million sustained ops/sec in production, every day.


Redis is only the api/driver, not the features, right ? Meaning it's just crud and not the many features/data-types that redis provides, correct ?


It's the full Redis API and protocol. See https://github.com/Netflix/dynomite/blob/dev/notes/redis.md for a list of all supported Redis features.

The benefit of Dynomite's support for the Redis API + protocol is a.) you can use any Redis client and b.) the same code can be used for standalone Redis on your laptop and on a distributed Dynomite cluster.


This is neat, but why use this over just using Cassandra-based DBs?

Also you were a speaker at Data Layer right?


Yes, I was a speaker at DataLayer.

There are two answers to your first question.

1. Dynomite pairs well with Cassandra. At scale, Cassandra is not a speed deamon when it comes to reads. Dynomite helps to improve Cassandra's read performance.

2. The use case for Dynomite as the primary database is for workloads that require high throughput and low latency. In other words, Dynomite delivers consistently lower latency at any scale.


Better to just use scylladb which natively handles it rather than run another layer of software.


Memsql + AMPS have "contact us" pricing unfortunately. Do you know any sharded queue (like amazon sqs,iron-mq) open-source ?


Some companies are closed source for a really good reason: it allows for Vertical Integration and Focused Execution. Bay Area companies spend millions of dollars worth of engineers salaries in projects that layer complexity upon complexity (and you need dozens of engineers to maintain them), just because they are in love with idea of open source. This is dangerous proposition, and it stifles innovation. There is an opportunity to dramatically simplify a lot of data pipelines. Check this out: http://docs.memsql.com/docs/pipelines-overview


I understand that but not everyone lives in SF or creates a microservice for each function they have hosted in an overpriced ec2-vm inside a subpar docker-container using json-serialization.

You can have (open-source or free) and develop without community (i think like chrome does that).

The question is: is it worth it ? Without vc-money-cash it's a little hard AND time-consuming.


Yes, those are both closed source but great companies that will work with you. Happy to put you in touch if you want. MemSQL also has a completely free community edition.

For open-source fast and clean messaging, I'd recommend NATS, they recently built NATS streaming that builds on top of the pub/sub to include kafka like persistence - http://nats.io/


Yes but memsql community doesn't have high availability (voltdb does the same thing).

I don't think nats can be used like I said by looking at the docs.


It has clustering: https://nats.io/documentation/server/gnatsd-cluster/

What are you looking to do? Sharding for what? Data size? Throughput?


Is that the command for installing Google's F1 RDBMS? It has those traits. I really wish they'd spin it off as a product since the only startup offering something similar got acquired and deep sixed by Apple.


Take a look at CockroachDB, which had its first public beta earlier this year [1]. It's directly inspired by Google's F1 and Spanner projects.

It's similar to FoundationDB (the Apple product you're referring to) in that it's an SQL database layered on top of a K/V store. It's based on some clever technology to accomplish distributed transactions, strong consistency and high availability, and is looking very promising.

[1] https://www.cockroachlabs.com/


Appreciate the tip. Im aware of it. They had a recent post on HN where they were just getting around to dealing with stability issues. I was really excited till I saw that. Im holding off for a while till it matures a bit.


I think they just went too far with full-sql-joins-indexes-column-families in the first 1.0 version. Better to build it little by little.


"First public beta" of a database product? No.


I don't think it meets the low latency requirement. It's a CAP theorem joke. Can't have strong consistency across multiple data centers with low latency because you have to read/write from a quorum of DCs in order to maintain availability in case one DC goes down.


Oh yeah. Low latency eould be saying too much. I think its worst-case for strong-consistency is around 30 seconds or so.

Anyway, closest to to low-latency part would be NonStop or OpenVMS clusers across redundant, leased lines. They're fast, scale well at HW level, use ACID databases, and high-availability. Commonly used in transaction processing. One VMS cluster survived 9/11 attack without losing a single transaction. NonStop does up to five 9's. Id be interested in the CAP analysis of these older methods.


It also depends on your definition of data center. It's really multi-region that is the problem since then it's the speed of light you have to deal with.

Quorum across some data centers is fast enough that it's not hard to manage in new applications.


That makes a lot of since. Many of the NonStop and VMS clusters were in same country but far enough from each other to isolate against many disasters. Could be why they did better on the issue.


You can always use quantum entanglement to help a (qu)bit.



Competitions good, imo. We have an "Embarrassment of riches", as it were, when it comes to databases.


And yet nothing that meets all my requirements. Still so much room to improve!


This is exactly like what I wanted Redis to be. Secondary indexes solve all the problems.

However, I wanted a feature that allowed custom indexes with Javascript, like CouchDB map functions.


This is available with SummitDB. The SETINDEX command has an EVAL option that allow for custom JavaScript indexes.


I'm working on a secondary index implementation in redis using modules, that should be out in a few days. It indexes redis hashes similar to how you index tables.


Check out redis-search module which I think does that (only for keys for now) https://github.com/RedisLabsModules/RediSearch/commit/86eb1c...


I wrote that module as well :)

The secondary index module will be more focused on automating more traditional indexing of numbers and simple string values. It has an SQL-ish language for WHERE expressions, and then the result of that is piped to any redis command you want it to.

If you're interested, here's a draft of the syntax (it's changed a bit since but you'll get the idea). https://gist.github.com/dvirsky/3ef73143a6d8212f2b50096a8eb6...


Oh, I should've seen the usernames. I'm not interested on the secondary-indexing, mostly on the full-text-search. It could grow on something big (compared to the benchmarks on redislabs https://redislabs.com/blog/adding-a-search-engine-to-redis-a...)


Yeah, that blog post is also me :) hehe. Anyway glad to hear you find it interesting. If you want to help out ping me. There's tons of stuff on the roadmap, and currently I'm almost entirely focused on the secondary index.


I still see some issues with it: 1. redis being single-thread, where inverted-indexes are ~easily sharable by cores (i think based on what I've read). This making hotspots easier(since it's a single core not single machine) 2. sharing of data(terms) between multiple indexes on 1 server (like in elasticsearch you use _routing, but all things are in 1 index, though some people like separate-per-user like dropbox does) 3. cluster not fully nice yet (losing writes) 4. no option to merge from different nodes (or even redis-api to do so as far as I know)


Salvatore just added support for asynchronous operations in side threads in modules. This would allow merging results from the cluster possible, and I want to get to it soon. I'm now implementing it in the client side and it works great, but I want it to be as transparent as in elastic.

The recent additions would also allow more parallelization of the indexes for reads. I could create RW mutexes in the engine and allow multiple clients to work on the same term index. It won't be trivial but it's not super hard as well.


Anyway, if you want to continue the discussion further, we can take it to reddit (I haven't posted the benchmarks there yet, it's a good idea anyway), or just email me, dvir at redis labs.


I can't find from the docs at all whether this persists to disk.

If not, what is the use case? Why would I need all those ACID-y guarantees if my server can fail at any time and all data is gone?


In-Memory could be used for processing data. Or as temp tables.

And by the way: a lot of existing databases support this.


SummitDB does persist to disk. Shutting down or kill -9 the server will not lose data.


looks great, I really like BuntDB, already using it in project :) Glad to see this new SummitDB, will definitely try it out!


Looks like it has no sharding unfortunately. Do you have any info/eta/idea on this op ?


Yes it is something that I certainly want soon, but there's no ETA at the moment. I haven't yet fully flushed out the strategy for sharding the key space.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: