(For specific examples that I've run into recently: does it have geo-spatial extensions, and does it crash and burn and corrupt all its data if a malloc() fails)
You can turn off durability, but then you risk losing everything.
For features... forget about geo-spatial extensions - they can only do read-committed isolation level... but as they only allow for one SQL statement per transaction I suppose this might not be such a big concern.
If you were using this for something like a session management database, or something where durability is not so important then it's probably fine. Not sure I'd use it for anything that relies heavily on transaction management features.
I see other limitations, like it cannot support CHANGE COLUMN, and it can't do joins on more than 2 tables. Actually, it doesn't look like it can do FULL OUTER or RIGHT OUTER joins, either. :(
Of course a new product will have certain limitation which will be removed as the product matures.
MemSQL is fully durable (with option of sync and async durability, similar to sync and async replication) with its transactional log. If MemSQL process is killed it will recover all commited transactions on the next startup. Durability has very little performance cost on throughput as MemSQL uses only sequential I/O. And of course, durability can be completely turned off for maximum performance.
1) Facebookers: Check
2) Data Scaling experience: Check
3) In-Memory with SQL semantics: Check
4) New-York based software that can be sold to quant funds: Check
I'm excited. It's downloading now.
The bullshit-bingo-lingo on their homepage is mindnumbing.
Meanwhile their actual software seems rather underwhelming, bordering on SnakeOil.
You can buy something like kdb, but that costs $70k a year and requires that your engineers learn some extremely new semantics if they're only used to SQL and (choose any popular language here, Python to C++).
Did you mean large, fittable-in-memory, and (distributed across multiple computers and/or very reliable)? Because that is much harder.
KEY and INDEX are literally the exact same thing and are just there to support different syntaxes: http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
Perhaps you are referring to primary keys vs secondary indexes, for which there is a difference in how they are physically stored on disk and in memory, particularly with InnoDB. There is a substantial performance gain to be had by having a relevant primary key that is referenced in the where clause as the data is stored in the same page as the primary key, both on disk and in the buffer pool. Secondary indexes reference the primary key, thus require two lookups.
Anyway, why is there not a single benchmark available to support the sales-pitch?
(In your favor I'll just pretend you didn't mention kdb here...)
Please explain. What's bad about kdb, besides how nonstandard it is?
At the least they should come up with some seriously impressive benchmarks before dropping names like that.
This ties with my own experience using it.
For example, where can we find your configurations for the MySQL vs. MemSQL benchmark you show in your video? Or how big the dataset was, etc...?
The way I see it, this is very much targeted at financial firms that use kdb+ or products like it that perform really well for things like writing lots of tick data from dozens of exchanges at a time and querying across them quickly in a familiar SQL style.
But that's only a query language. Not a db architecture.
Let's see the source code. That's the only way to verify the approach that is being taken.
Given the source the first thing I'd do is search for "mmap".
They already get one demerit for using a row store.
And another for disk access.
It's not a column store. We are thinking about it, maybe some time in the future as we add more analytical feature. It's a row store.
We do write back to dist. We have a transaction log, which we truncate once it grows over a certain threshold, by taking a snapshot.
We see a future where OLTP databases live in memory, and where you have hundreds of terabytes of memory and hundreds to thousands of CPUs at your disposal.
We built MemSQL to make it easy to go directly into memory and get the speed improvements we all need. What we're releasing aims to fulfills that vision.
Also, pricing. Come on, man! Pricing!!
p.s. Isn't a totally reasonable business model for a truly faster DB engine "free, everywhere, bought by Oracle?"
Data must fit in RAM, no joins over >2 tables, no transactions, no builtin support for clustering/sharding/horizontal scaling whatsoever.
What is the advantage over memory-tables, MySQL on a ramdisk or something purpose-built (redis)?
I'd say it's a database for a niche market, as one of the folks from MemSQL has been saying. It would fit in nicely under a few scenarios.
> ..something purpose-built (redis)?
Errr maybe "features" like GROUP BY with HAVING? I think you're trolling, Mr Moe.
Says the guy who resorts to selective quoting to make a point? You may want to review my original comment where I mention two other options.
Do you need to ensure that all of your data fits in memory? It's easily possible to have ten gigs in your db but only need a small amount of it to be hot in ram. Does it take a really long time to restart if it has to warm all the data?
As for your comments:
1. Ad-Hoc SQL in VoltDB is a current area of focus in VoltDB development. We use similar compilation and expect them to be no slower than our stored procedures in the 2012 timeframe. Starting with our release next week, we'll also support multiple SQL statements in a serialized/ACID ad-hoc transaction.
2. On a single-node (or a several replicated nodes), VoltDB doesn't really have many issues with data skews. Rebalancing a distributed system is also in development, but hasn't turned out to be much of an issue with our production customers.
3. Faster at what? Single statement non-durable transactions? That's probably true for simple SQL operations for the time being. It gets murkier once you consider VoltDB features like materialized views that avoid table-scans altogether. I am also curious how durability affects performance though.
But I'm really interested to revisit this comparison once MemSQL supports real transactions and replication/sharding. For the time being, most of the VoltDB use cases we target won't run on it.
Ycombinator seems to have backed a few of these next-gen databases. It will be interesting to see who wins.
EDIT: Rethinkdb. But it looks liked they've dumped their mysql engine and gone to a pure key-value store.
Xeround is a sas model, where you can't get exact same experience as on your local machine.
Is there more to this than the article suggests?
Frankly, I can't see how their query compliation is much better than anything Oracle, SQL Server or PG does already. (see here: http://developers.memsql.com/docs/1b/codegen.html) Perhaps there is no plan cache in MySQL, or the plan cache is inadequate?
Incidentally, you don't want to use it if you need to do multi-statement transactions. Can't be done. Check out the documentation here:
It has a few extra tricks up its sleeve to enhance indexes in MySQL, but again nothing really particularly spectacular. http://developers.memsql.com/docs/1b/indexes.html
More interesting for me, is that it can be hosted on EC2, which is pretty neat!
There is no plan cache in MySQL. SQL Server has it, but it's nearly not as efficient. I know, since I used to work there.
Multistatement transaction are coming. We are beefing up SQL surface area, as well as working on many more exciting things.
Incidentally, sorry if I seem to be disparaging the product. I'm actually looking at it myself, but want to know what the limitations are. Multi-statement queries would be a big win. Are you considering implementing higher isolation levels? Also, as was asked above, how are the startup times for multi-gigabyte/terrabyte databases?
* Hint for the unwary - if you run a SELECT statement with a WHERE 1=0 AND <rest of predicates> (for brain dead "dynamic" SQL generating apps), the initial compilation can take a dramatically long time in Oracle 10g (and I believe 11g, but cannot confirm). Oracle have not got any mechanism to shortcut the query compilation for such queries. A bit obscure, but I never saw this problem with SQL Server. Oracle support were stumped by the issue too and gave me short-shrift ("it might be fixed in an 11g patch... some time").
Recovery goes at the speed of disk.
The compile time for memsql is mostly code generation time. Once the code is generated/compiled MemSQL is doing a very good job matching the queries against the existing plan by removing parameters from the query text.
CREATE PROCEDURE foobar
On very large queries (many table joins) there can be a noticeable performance hit while the DBMS figures out the optimal query plans. Stored procedure compile and store this plan after first execution (some pre-compile) so the DBMS can just look it up in the cache.
It also removes the need to do runtime parsing (compiled vs. interpreted languages). There are more nuances to Stored Procedures than I can put here, but here's a link to a more in-depth analysis I wrote:
In general, MySQL has limited query plan cache (last I read, it was one cache per database connection and was wiped out when the client disconnected), so this is definitely a win. However, since every other DBMS out there already has robust query cache, I don't think the Postgres/Oracle/MS SQL etc. folks are losing any sleep.
Giants are investing into big memory technology no doubt about it, which is great. That's the beauty of the huge big data market - one size doesn't fit all.
I believe some DBMS vendors are working on the interpreted nature of the execution, but basically the query plan determination makes up the bulk of the parsing/execution time. Compiling SQL to something... "native" doesn't provide a whole lot of win in terms of execution latency.
Since we work in memory, you see a huge perf boost with code generation because there are no lags with disk.
Even if it is (and I can't see anything on their developer website) then if it's ACID compliant, like they say, then it will have to be writing to the disk at some point - at the very least it will need a transaction log. Otherwise, it's not really usable for anything you need to persist to long term storage!
Yes, it does write to disk (append only logging I think) but my point was, that if you are keeping ALL your data in memory, you can optimise the storage for fast queries, as opposed to a hybrid / paged memory and disk system.
"By default, MemSQL runs with full durability enabled: transactions are committed to disk as a log and later compressed into full-database snapshots. The snapshot and log files are stored in the datadir directory configured in memsql.cnf. When MemSQL is restarted, it will recover itself into the state it was in before the shutdown, by reading the snapshot and log files in datadir" 
I missed that piece of documentation on their website. I think they could do better to explain better the advantages and how their technology works. Their product overview doesn't really tell you much of anything, except that they have an enhanced query parser. 
We will also put the technology on steroids, by building a clustered system
Isn't that the same as MySQL Cluster, minus the SQL-to-C++ part (and maybe lock-free structures)? Why not fork MySQL Cluster to add the SQL-to-C++ code generation? Then it seems you could claim (from MemSQL site) "MemSQL is ... still faster than other in-memory databases because of SQL-to-C++ code generation" and also support every feature of MySQL Cluster. Why reinvent the wheel only to create something with great limitations compared to the existing wheel?
It is a little more complicated than that.
MySQL Cluster is lock-free, distributed, HA, and has better performances:
And MySQL Cluster has been used in mission critical apps for years.
What does MemSQL give me that MySQL Cluster doesn't?
How does the database handle this sort of thing?
Edit: I know I'm asking a lot about the plan cache, but as it's a core selling point, then I'd like to know how data distribution changes will affect performance. Stale queries will potentially adversley affect performance under certain circumstances.
Does anyone know if there are any plans for having MongoDB style replica sets (e.g. sharded and replicated databases in a cluster)?
Also, it would be great if it supported the native mysql replication, so you could have MemSQL replicas of a master mysql DB.
Side note: IMO, If you want to completely dominate this space, make it super easy to add new nodes and replicate the existing DB without having to take a snapshot from the master.
I should be able to try it for one of my apps over this weekend!