Can somebody get us an on disk, small, fast, in-process, low-footprint NoSQL DB? Thus far, I've been pressing SQLite into use for a lot of monotyped data that would have been better served by a NoSQL DB, but I couldn't find one suitable.
Apparently, in-memory DBs in now. They're fast and all, that's nice, but I'm still running on a machine with 8 Gigs of ram, and when the dataset starts to exceed that, I want an option other than "grab some more DDR3." I remember the fate of TinyMUD: on-disk operation must at least be an option for any DB I'd use for datasets of a non-fixed size.
There are several key-value stores you can choose from. RocksDB, LevelDB, lmdb, Berkeley DB, Tokyo Cabinet, sophia, and others. What about those? You can always build abstractions on top.
BoltDB can be corrupted on crash (kill -9 or hard reset) at least this was the case when I gave it a try. Any database can but in that case that was easy to achieve.
Thanks for the suggestions. Some of those might work. The only problem with k/v stores is the serialization/deserialization cost, as most of them can only store strings.
LMDB can hand you back a pointer to your object in the memory map, and we can easily use that pointer with Cap'n Proto without any calls to malloc().
source: https://news.ycombinator.com/item?id=12394385
You can try LMBD - https://symas.com/products/lightning-memory-mapped-database/
In my opinion there is exactly two production ready embedded databases - SQLite and LMDB. At least this two databases have decent crash recovery mechanisms (I've tested both). I haven't tested all of them, maybe ForestDB and BerkleyDB are good as well.
No idea, but with about 50,000 regular users hitting the local SQLite DB on their mobile devices hard, I'd get a corrupted DB report a few times a week.
It's possible my SQLite configuration wasn't as crash-proof as the default install. I moved onto another project before I could investigate further.
Totally! I can just sudo apt-get install low-latency-strong-consistency-multi-dc-with-declarative-query-language-db as a virtual package and get the right one for my distro.
MemSQL for blazing fast distributed full SQL database with cross-datacenter replication and in-memory rowstore + disk-based columnstore.
ScyllaDB for Cassandra rewritten in C++ for blazing fast dynamo-style multi-master multi-datacenter disk-based wide-column database.
I'll also throw in AMPS (by 60East) as the best messaging platform that supports innovative SQL and real-time state-of-the-world queries on it's message streams (instead of using that rabbitmq or kafka crap).
The benefit of Dynomite's support for the Redis API + protocol is a.) you can use any Redis client and b.) the same code can be used for standalone Redis on your laptop and on a distributed Dynomite cluster.
1. Dynomite pairs well with Cassandra. At scale, Cassandra is not a speed deamon when it comes to reads. Dynomite helps to improve Cassandra's read performance.
2. The use case for Dynomite as the primary database is for workloads that require high throughput and low latency. In other words, Dynomite delivers consistently lower latency at any scale.
Some companies are closed source for a really good reason: it allows for Vertical Integration and Focused Execution. Bay Area companies spend millions of dollars worth of engineers salaries in projects that layer complexity upon complexity (and you need dozens of engineers to maintain them), just because they are in love with idea of open source. This is dangerous proposition, and it stifles innovation. There is an opportunity to dramatically simplify a lot of data pipelines. Check this out: http://docs.memsql.com/docs/pipelines-overview
I understand that but not everyone lives in SF or creates a microservice for each function they have hosted in an overpriced ec2-vm inside a subpar docker-container using json-serialization.
You can have (open-source or free) and develop without community (i think like chrome does that).
The question is: is it worth it ? Without vc-money-cash it's a little hard AND time-consuming.
Yes, those are both closed source but great companies that will work with you. Happy to put you in touch if you want. MemSQL also has a completely free community edition.
For open-source fast and clean messaging, I'd recommend NATS, they recently built NATS streaming that builds on top of the pub/sub to include kafka like persistence - http://nats.io/
Is that the command for installing Google's F1 RDBMS? It has those traits. I really wish they'd spin it off as a product since the only startup offering something similar got acquired and deep sixed by Apple.
Take a look at CockroachDB, which had its first public beta earlier this year [1]. It's directly inspired by Google's F1 and Spanner projects.
It's similar to FoundationDB (the Apple product you're referring to) in that it's an SQL database layered on top of a K/V store. It's based on some clever technology to accomplish distributed transactions, strong consistency and high availability, and is looking very promising.
Appreciate the tip. Im aware of it. They had a recent post on HN where they were just getting around to dealing with stability issues. I was really excited till I saw that. Im holding off for a while till it matures a bit.
I don't think it meets the low latency requirement. It's a CAP theorem joke. Can't have strong consistency across multiple data centers with low latency because you have to read/write from a quorum of DCs in order to maintain availability in case one DC goes down.
Oh yeah. Low latency eould be saying too much. I think its worst-case for strong-consistency is around 30 seconds or so.
Anyway, closest to to low-latency part would be NonStop or OpenVMS clusers across redundant, leased lines. They're fast, scale well at HW level, use ACID databases, and high-availability. Commonly used in transaction processing. One VMS cluster survived 9/11 attack without losing a single transaction. NonStop does up to five 9's. Id be interested in the CAP analysis of these older methods.
It also depends on your definition of data center. It's really multi-region that is the problem since then it's the speed of light you have to deal with.
Quorum across some data centers is fast enough that it's not hard to manage in new applications.
That makes a lot of since. Many of the NonStop and VMS clusters were in same country but far enough from each other to isolate against many disasters. Could be why they did better on the issue.
I'm working on a secondary index implementation in redis using modules, that should be out in a few days. It indexes redis hashes similar to how you index tables.
The secondary index module will be more focused on automating more traditional indexing of numbers and simple string values. It has an SQL-ish language for WHERE expressions, and then the result of that is piped to any redis command you want it to.
Yeah, that blog post is also me :) hehe. Anyway glad to hear you find it interesting. If you want to help out ping me. There's tons of stuff on the roadmap, and currently I'm almost entirely focused on the secondary index.
I still see some issues with it:
1. redis being single-thread, where inverted-indexes are ~easily sharable by cores (i think based on what I've read). This making hotspots easier(since it's a single core not single machine)
2. sharing of data(terms) between multiple indexes on 1 server (like in elasticsearch you use _routing, but all things are in 1 index, though some people like separate-per-user like dropbox does)
3. cluster not fully nice yet (losing writes)
4. no option to merge from different nodes (or even redis-api to do so as far as I know)
Salvatore just added support for asynchronous operations in side threads in modules. This would allow merging results from the cluster possible, and I want to get to it soon. I'm now implementing it in the client side and it works great, but I want it to be as transparent as in elastic.
The recent additions would also allow more parallelization of the indexes for reads. I could create RW mutexes in the engine and allow multiple clients to work on the same term index. It won't be trivial but it's not super hard as well.
Anyway, if you want to continue the discussion further, we can take it to reddit (I haven't posted the benchmarks there yet, it's a good idea anyway), or just email me, dvir at redis labs.
Yes it is something that I certainly want soon, but there's no ETA at the moment. I haven't yet fully flushed out the strategy for sharding the key space.
Apparently, in-memory DBs in now. They're fast and all, that's nice, but I'm still running on a machine with 8 Gigs of ram, and when the dataset starts to exceed that, I want an option other than "grab some more DDR3." I remember the fate of TinyMUD: on-disk operation must at least be an option for any DB I'd use for datasets of a non-fixed size.