Hacker News new | past | comments | ask | show | jobs | submit login
How To Make An Infinitely Scalable RDBMS (highscalability.com)
154 points by jpmc on Nov 25, 2013 | hide | past | favorite | 89 comments

There was very interesting presentation by one professor. I'm not sure about what university, but he seemed to know his work.

He talked about how databse world is about to change. ACID is really expensive in terms of resources, and so are the more difficult things about relational schema (foreign keys, checks, etc). And architecture of classic RDBMSes is pretty wasteful -- they use on-disk format but cache it in memory.

He talked about how there are basically three new paths for DBMSes to follow. 1) Some drop the restrictions to become faster. This is the NoSql stuff, because you don't really need ACID for writing to Facebook wall.

This is called NoSql database.

2) OLAP, in data warehousing, the usual way to do things is that you load ridiculous amount of data into database, and then run analytical queries, that tend to heavily use aggregation and sometimes use just few dimmensions, while the DWH data tend to be pretty wide.

For this, column store makes perfect sense. It is not very quick on writes, but it can do very fast aggregation and selection of just few columns.

This is called Column store.

3) In OLTP, you need throughtput, but the question is, how big are your data, and how fast do they grow? Because RAM tends to get bigger exponentially, while how many customers you have will probably grow linearly or maybe fuster, but not much. So your data could fit into memory, now, or in future.

This allows you to make very fast database. All you need to do is to switch the architecture to memory-based, store data in memory format in memory and on disk. You don't read the disk, you just use it to store the data on shutdown.

This is called Main memory database.

No, that was the presentation. It was awesome, and if someone can find it, please give us a link! My search-fu was not strong enouhg.


What interests me is that we have NoSql databases for some time already, and we have at least one huge (are very expensive) column store: Teradata. But this seems to be first actual Main memory database.

My dream would be to switch Postgres to main memory or column store mode, but I guess that's not happening very soon :)

> But this seems to be first actual Main memory database.

Eh, not really...

This is exactly what SAP has been doing for several years via Hasso Plattner and the Potsdam Institute: https://epic.hpi.uni-potsdam.de/Home/HassoPlattner

If you've ever worked with large scale "enterprise" database warehouses, they tend to be slow and clunky. Back in 2006ish SAP took the whole Data Warehouse (well mainly just the data cubes) and chucked it into a columnar database (at the time it was called TREX, then became BW Accelerator) - http://en.wikipedia.org/wiki/TREX_search_engine

TREX exist way before 2006. SAP also bought a Korean company called P* (IIRC) which did non-columanr (traditional relational) and threw it into memory. SAP also had a produce called APO LiveCache - http://scn.sap.com/community/scm/apo/livecache - which lived around the same time.

This has now all evolved to a standard offering called SAP HANA - http://www.saphana.com/welcome - In it's second year of inception I believe SAP did roughly $360m in sales just on HANA alone.

Also, IIRC is InnoDB basically the open source version of exactly what you're talking about with "Postgres to main memory"?

edit- correction in TimesTen

InnoDB isn't anything like that - it's a transactional database engine that's been around since the 90's and has since become the standard storage engine for MySQL - it competes directly with Postgres' storage layer.

Is this the talk that you are referring to? http://slideshot.epfl.ch/play/suri_stonebraker

Note that Stonebraker makes some good points, but there are many ways to build scalability and Stonebraker is too fast to dismiss many.

In particular, his criticism of traditional databases seems based more on philosophy rather than evidence.

I'd advise reading both sides of the story:






The date on some of those posts in interesting. 2009 is quite a while ago now, and I'd suggest that columnar datastores haven't exactly taken over. Some implementations have made some progress (eg Cassandra), but OTOH many non-traditional datastores have added traditional-database like features (eg, Facebook's SQL front end on their NoSQL system), and traditional databases have added NoSQL features too.

Stonebraker is a very smart person but he's also not shy about promoting his own companies/research. You generally get a well-informed but very opinionated take on things from him.

VoltDB, for example, is good for certain complex workloads over large but not-too-large data sets. For a lot of situations it isn't really an alternative to memcache+MySQL or a NoSQL solution.

If I may drool a little, you guys represent the heart of Hacker News. Insightful summary, mentioning that somewhere somebody gave such a talk. As I was reading the first comment I was silently cheering for "a librarian's follow-up", and there it was!

Yes, that's the one. Thank you! I'm bookmarking it right now.

The log mechanism Prof. Stonebraker prefers, command logging vs ARIES, almost all newer data stores use command logging w/checkpoints (i.e., redis, mongo) and ship changes to other nodes similarly.

After running a large production redis environment, having a large redo log makes startup/recovery painful. I'm not convinced command logging is the most efficient in all scenarios especially when doing counters where the command is more verbose than the resulting change.

That sounds very much like Michael Stonebraker's typical pitch these days.

You mentioned OLTP. Erlang's Mnesia store comes to mind, but as far as I'm aware it's limited to a 4GB data set. I'm not sure if that qualifies as a main-memory db, but it might be similar.

To add to your list, Vertica (HP) and Paraccel are columnar; TimesTen was a main-memory database bought a number of years ago by Oracle.

> My dream would be to switch Postgres to main memory or column store mode, but I guess that's not happening very soon :)

If it can be done besides the traditional architecture, be it in a fork or without touching existing code; and if you can at least start the work, it could happen soon.

A few years ago I was tinkering with Haskell and looked at a framework call HAppS which kept all its state in memory. Doesn't look like the project has really been active lately.

HAppS is dead, long live Happstack!

What you're talking about was the HAppS-State component of the HAppS application server, a project which is in deed not active anymore. Happstack is the active fork of HAppS and had a "happstack-state" component for a while, but this was eventually rewritten from scratch and made independent of Happstack and is now known as acid-state [1]. It's even used for the new Hackage server that powers the Haskell package ecosystem.

[1] https://github.com/acid-state/acid-state

And just to add to the list of c-store databases, there's also Sybase IQ. (I believe Sybase is now owned by SAP so it may have been rebranded)

Teradata is not a column store afaik. Vertica would be a good example of such.

I'm a little skeptical:

- a bunch of the novel components (the UPS aware persistence layer, for example) aren't actually built yet

- they're pushing for people to build businesses on it already. I would characterize it as "bleeding-edge with bits of glass glued on", so this doesn't seem entirely honest.

- there's mostly a lot of breathless talk about how great and fast and scalable it is, but no mention of CAP theorem. To boil down their feature set, it's an in-memory RDBMS using the Actor model.

Hi, Alan. Yes, many things are not built yet. Nowhere am I pushing anybody to build their business on it yet, but I am looking for hackers and early adopters/alpha testers. I've gone through pains on every doc that I've created that this is early, alpha, needs lots of work--including the 2nd paragraph of the linked-to article: "InfiniSQL is still in early stages of development--it already has many capabilities, but many more are necessary for it to be useful in a production environment."

Regarding CAP, I'm not addressing multi-site availability at this stage--I want to get single site fully operational, redundant, and so on.

And, yes, to boil it down, it's an in-memory RDBMS using the Actor model.

The most important feature is that it performs transactions involving records on multiple nodes better than anything. This is the workload that keystores functionally cannot do, and which other distributed RDBMS' suffer under. It's also open source, has over 100 pages of technical docs, and is functional enough for people to pound on with some workloads--but not something to put into production yet.

CAP theorem can apply to any clustered system, it doesn't have to be multi-site. What happens if 6 of your 12 machines die? What if they get cut off from the other 6?

edit: There's a bit of discussion further down about the SQL implementation. That's something I was very curious about as well. The projects linked below spend a lot of time working on supporting full ANSI SQL, and reducing latency by pushing down as many operations as possible. The Overview page doesn't appear to mention how filtering, aggregation, windowing, etc. work in your system.

Also, I noticed on your website that you compare InfiniSQL to Hadoop. How do you feel it compares to Impala (http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...) and Shark (https://amplab.cs.berkeley.edu/projects/shark-making-apache-...)?

Hi, Alan. Regarding CAP, I think that given redundant cluster interconnects, redundant managed power, odd # of cluster managers for quorum, all mean that split brain is just about out of the question, configured properly.

The main reason that I have a FAQ about Hadoop is that I have been asked repeatedly by people, "what's the difference between InfiniSQL & Hadoop?" It seems to be the data project most on a lot of people's minds. It's a fair question, so I created an FAQ entry on my site.

I don't know what Impala or Shark's performance are for transaction processing. Show me the numbers, in other words. I don't believe that Hadoop is going to eat the world, but that it's best use cases are probably in the reporting realm. It seems that Impala wants to bridge that gap between reporting and operational/transactional database--but I can't read from their architectural description just how well it would actually perform for transaction processing workloads.

Regarding Shark, it looks to me like they still see it as an parallelized reporting system, and not geared towards OLTP/operational workloads.

I expect that InfiniSQL will be able to handle quite sophisticated analytics workloads, as more capabilities are added, but they will be more for real-time. I don't see it displacing special purpose analytics environments, especially the masive unstructured ones.

Regarding filtering, aggregation, windowing, I haven't documented it yet. The SQL engine is pretty simple at this point--it parses, makes an abstract syntax tree, then executes. If you need more, then the code is there. :-)

Redundant components and pathing, properly implemented and managed, are what allow enterprise storage arrays, mainframe clusters, Tandems, and the like, to operate 24x7 for years on end. Their myth seems to work.

Yes, but not without addressing what should happen when split brain occurs.

Well, split brain means two parts think they're both active. At that point, nothing can be done (short of manual intervention), because both parts are faulty, and likely, a bunch of others are failed.

But, to avoid split brain in a failover circumstance means that a replica won't come up unless it gets a majority of votes. There's no way for any other replica to also get a majority of votes, therefore only one replica will come active. No split brain. This, of course, assumes proper operation of the cluster management protocol.

(this is CP operation)

Indeed, there's a lot of misunderstanding around this aspect.

The strength of the eventually consistent systems doesn't lie in the fact that it guarantees consistency during network partitioning, but that it maintains availability in face of partitioning. Even parts of the cluster that have been cut-off from the quorum can operate, in various degrees of degradation, ranging from being able to respond to stale queries, or also accept writes whose consistency is later resolved (for example with vector clocks or Commutative Replicated Data Types, see http://highscalability.com/blog/2010/12/23/paper-crdts-consi... or http://basho.com/tag/crdt/)

If I understood it correctly, InfiniSQL isn't trying to solve the problem of providing backend capacity for parts of your cluster that are currently partitioned, assuming that you can minimise the likelihood of this event to happen. If a network partition happens in the cluster, it's also very likely that all services in that partition will not be able to serve transactions, hence there is no much to gain from a system that is able to accept writes or perform stale reads without quorum.

On the other hand there are other workloads, like batch processing, that might benefit of being able to continue operating during a network partition without loosing big parts of processing capacity.

> configured properly.

... and there is where it falls apart. Sooner or later, "someone" is going to do something Incredibly Dumb that is going to take down a lot of nodes.

If you are betting that you can just add enough redundancy split brain can't happen, I have to question why I should take you seriously with important data.

(It also indicates to me this is going to be ludicrously expensive to set up though, but that's another issue)

Redundancy, quorum protocol, proactive testing, rigorous QA.

And actual 24x7 environments with important data are ludicrously expensive. I expect InfiniSQL to be less expensive since it's based on x86_64, Linux, is open source. But yeah, hardware and environment need to be right.

I'm not sure what I said wrong, but I'm nowhere claiming that split brain (or any failure scenario) can be ruled out entirely in all circumstances--but in practice, split brain is avoided 24x7 for years on end in many different architectures.

It's not magic.

Just like you can't have "enough" storage redundancy. You can have a 100-way mirror of hard drives that will still lose data if you lose 100 disks in less time than somebody replaces one of them.

Congratulations, you have convinced me never to use your database. You can't ignore CAP, because machine failures, momentary high latency and bona fide network partitions happen.

I work at Cloudera (although not on Impala). I work on HDFS.

Impala was about creating a low-latency SQL engine for Hadoop, so that queries could be done interactively by a human at a keyboard. This is something that you don't really get with Hive (despite all the recent hype) because it simply is too slow, and has high startup costs. That's unlikely to change in the future because of the overhead of spinning up JVMs, starting MapReduce jobs, etc.

It seems like you are trying to target the OLTP market. That's a difficult market to crack. A lot of the value of things like Oracle and Microsoft SQL server is not in the database itself, but in the surrounding software. Performance is nice, but unless you can get orders of magnitude, it's very difficult to compete.

Will anybody ever bridge OLTP and OLAP? The last people who claimed to be trying to do that were Drawn to Scale, and we all know how that turned out. I think it's better to focus on doing one thing well.

Good luck.

> Regarding CAP, I think that given redundant cluster interconnects, redundant managed power, odd # of cluster managers for quorum, all mean that split brain is just about out of the question, configured properly.

Failure is never out of the question, you either plan for it or you suffer from it. CAP applies.

>Nowhere am I pushing anybody to build their business on it yet...

"In fact, InfiniSQL ought to be the database that companies start with--to avoid migration costs down the road."

This is far from looking for hackers and early adopters. I understand that you're enthusiastic about something you've created, but let's be reasonable for a moment. I'm more than happy to try this out, but starting my business on something that isn't proven yet and has a fair number of durability features yet to be implemented is a no, no.

Edit: My posts on this thread may be coming off as negative and that's not what I intended. I'm cautious of new technology that purports to deliver the world to me on a silver platter. That said, I'd be happy to throw everything I've got at it to see what shakes loose so you can get this to production quality sooner.

When I see a new distributed database, the first thing I want to know is "Is it AP or CP?"

If you haven't answered that question yourself, it's likely you have a partition-intolerant system, and that in real-life scenarios people will lose data.

CP in a single site. Between sites, I haven't considered it too much.

Do you think you could write more about this?

> UPS systems will stay active for a few minutes, based on their capacity, and the manager process will gracefuly shut down each daemon and write data to disk storage. This will ensure durability--even against power failure or system crash--while still maintaining in memory performance.

How does a UPS ensure durability against system or program crashes, disk corruption in large clusters, and other failures that can affect a simple write()?

> The real killer for database performance is synchronous transaction log writes. Even with the fastest underlying storage, this activity is the limiting factor for database write performance. InfiniSQL avoids this limiting factor while still retaining durability

How do you plan to implement this (since it appears it hasn't been implemented)? What is your fundamental insight about synchronous transaction logs that makes InifiSQL capable of being durable while (presumably) not having a synchronously written transaction log? If your answer is the UPS, please see my first question.

Edit: I don't see any mention of Paxos anywhere. Could you explain what you're using for consensus?

Hi, yid. UPS protects against multiple simultaneous system crashes. Single system crash gets failed over, no problem. If both UPS systems detect their upstream PDU's as being out, then the InfiniSQL management protocol will initiate graceful shutdown, including persisting to disk. For write() issues, at least intially, I think that stuff in commodity hardware (such as ECC memory) is sufficient protection in most cases. Attaching a high end storage array, or using ZFS, also protects against low level disk problems. I don't see those problems as needing to be solved for a 1.0 relase, but am very much open to contributions that address those issues any time you want!

The fundamental insight about not needing transaction logs is pretty simple actually: if the power is guaranteed to either stay on, or to allow the system to quiesce gracefully, then the cluster will not suddenly crash. That's the motivator for transaction logs--to make sure that the data will still be there if the system suddenly crashes. Get rid of the need for transaction logs, get rid of the transaction logs.

Regarding consensus, I expect that there will be a quorum protocol in use amongst an odd number greater than 2 of manager processes, each with redundant network and power. But the specific protocol I haven't ironed out. If there's something I can grab off the shelf then it may be preferable to implementing from scratch, but I haven't gotten there yet.

This stuff hasn't been implemented yet, but the core around which it can be implemented, has been.

Do I sense a volunteer? ;-)

I feel like you're banking a little too heavily on external measures for error protection/durability like UPSes and ECC memory, rather than embracing the inevitable fact that corruption and failures will occur. In short, I'd really need to see a white paper on why you're not reinventing the wheel before I'd use InifiSQL.

Kudos for engaging the community though; please do keep us posted as you progress.

Actually, there's precious little that an application can do if a memory chip fails, or if ECC gets too many corrupted bits. If it gets too many corrupted bits, the kernel will generally do something like halt the system. I am not familiar with any application which does a write-read-write (or similar) to memory, but would be curious to learn about them. I'm sure such an algorithm can also be used in InfiniSQL.

I am familiar with applications such as IBM WebSphere MQ which does a triple-write to disk for every transaction, to overcome corruption problems. But even IBM will recommend turning that parameter off for performance reasons if the storage layer performs that kind of verification, such as HDS or EMC arrays. So, if an InfiniSQL user wants that level of storage protection, they can buy it.

Regarding InfiniSQL's planned use of UPS systems, that's pretty much the identical design to how the above-mentioned storage arrays protect block storage from power loss. I'm just moving the protection up to the application level.

Thanks for the conversation and thoughtful comments. Please go to http://www.infinisql.org/community/ and find the project on Twitter, sign up for newsletter, hit me on LinkedIn.


Hi, yid. For some reason I can't reply to your last comment. I think I didn't explain things well enough--I do plan to implement synchronous replication to protect against single replica failure. I describe the plan somewhat in http://www.infinisql.org/docs/overview/

Much of the code for the replication is in place (and it actually worked a few thousand lines of code ago) but it's currently not functional.

I have a big backlog to work on, but n-way synchronous replication is definitely in there.

> Actually, there's precious little that an application can do if a memory chip fails, or if ECC gets too many corrupted bits.

That's where replication and (distributed) consensus comes in, usually at the application level.

Well one example is that bug that caused a panic on a certain date (I think it may have been leap year related) - machines just die then. Only way to recover from that would be some neat reboot system that recovers the data from RAM. But, all systems have dataloss, so this problem affects any main-memory system.

You may get more volunteers by publishing a paper outlining the core concept. I clearly remember reading things like the H-Store paper, or the Dremel paper, and saying "damn, this makes sense and is really cool". Implementation details can be worked out and engineering approaches tried. But the underlying concept should be clear.

I'll consider this--meantime, will you please follow me on Twitter and/or sign up for the newsletter, so that you can be informed when the document is ready?

UPSs don't always work as expected.

For example, I have real-life experience from this event:


And that was a top-tier datacenter at the time. Good luck doing better on your own, and just punt if you are using the cloud.

I totally agree--the architecture I'm calling for is to have redundant UPS's, each managed by InfiniSQL processes--for ultimate availability. If people just want high performance but want to live with a datacenter / cloud provider's ability to maintain power, then I want to support them in that, too.

But I've suffered power outages in data centers, and they'll eventually come around to bite everybody.


Clause 13 is a real pain to deal with when exposing this over the network.

I guess the developer wants to sell a license (like the mysql java client GPL'ing).

Can't blame him, he needs to get paid.

Hi, gopalv. What samspenc said. My understanding of the AGPL is that only the modifications made to the source code itself of the covered project would need to be opened up (or have a commercial license). Meaning that merely using InfiniSQL won't require you to open source your app. MongoDB has the same license BTW, and lots of people use it without being forced to open their code. And, yes, I want to get paid somehow--but the AGPL won't stop anybody from using my work how they see fit. But if they modify it and distribute it, then they'll have to comply with the license (or contact me directly for an alternate arrangement).

I fully understand what this means and I hope you do get calls about alternate licensing, but remember that people like me do not make these decisions.

I thankfully don't have to - this means I don't need to talk to lawyers about this.

Because AGPL took away the most important bit of unassailable ground I had to argue with when it came to deploying GPL - "Using this code implies no criteria we have to comply to, only if we distribute it".

Clause 12 and 13 - basically took that away from me completely.

Look, I'm not going to tell you what license to use.

But leave me enough room to complain that I have had trouble convincing people that we can use AGPL code in a critical function without obtaining a previous commercial license by paying the developer.

Hi, gopalv. I'm glad to talk to you further privately if you wish. You can go to InfiniSQL's site to send me your email, connect on LinkedIn, or whatever: http://www.infinisql.org/community/

I'm not too religious about licensing--if I can get early adopters/contributors, and so on, I'm willing to consider changing the license terms.

I'm looking for open doors.

Personally, I think the choice of AGPL is good. It enables you to give to the community and get useful community involvement, while allowing commercial companies to have proprietary forks (for a cost) as well as commercial support.

All the best!


MongoDB does this too. I personally think its not too bad - its free for whoever wants to use it, but if you want to modify and use it commercially, you do have to pay.

In-memory distributed database? VoltDB is already way past 500Ktx/sec on a 12-node cluster.

On their site though, it says no sharding and that it can do these 500Ktx/sec even when each transaction involves data on multiple nodes. Does this performance degrade directly in relation to the number of nodes a tx needs to touch?

A simple, straightforward, wire-level description of how things work when coordinating and performing transactions across would be very useful. There's a lot of excited talk about actors, but nothing that really examines why this is faster, or any sort of technical analysis.

Hi, Michael. Yes, VoltDB is very fast, but they self-admittedly do not perform well if transactions contain records spanning across multiple nodes. That is the key feature difference between InfiniSQL and VoltDB (along, of course, that their project is functionally much further along).

If you want more details about how things work when performing transactions, I think that the overview I created would be a good starting point. It probably doesn't have everything you'd ask for, but I hope that it answers some of your questions: http://www.infinisql.org/docs/overview/

And to answer your question about performance degradation pertaining to number of nodes each transaction touches, I have not done comprehensive benchmarks of InfiniSQL measuring that particular item. However, I do believe that as multi-node communication increases, throughput will tend to decrease--I expect that the degradation would be graceful, but further testing is required. The benchmark I've performed and referenced has 3 updates and a select in each transaction, all very likely to be on 3 different nodes.

I'd like to invite you to benchmark InfiniSQL in your own environment. I've included the scripts in the source distribution, as well as a guide on how I benchmarked, as well as details on the specific benchmarking I've done so far. All at http://www.infinisql.org

I'd be glad to assist in any way, give pointers, and so on, if there are tests that you'd like to do. I also plan to do further benchmarking over time, and I'll update the site's blog (and twitter, etc) as I do so.

Please communicate further with me if you're curious.

Thanks, Mark

So basically, VoltDB acknowledges that cross-partition transactions are Hard and has put a lot of effort into minimizing them. (This is basically the entire point of the original HStore paper.)

InfiniSQL says don't worry, we'll just use 2PC. But not just yet, we're still working on the lock manager.

I look forward to your exegesis of how you plan to overcome the well-documented scaling problems with 2PC. Preferably after you have working code. :)

I'm not doing 2PC. I'm curious what gave you that impression, and I'll clarify any documents that seem to give off that impression.

I'd like InfiniSQL to handle hard workloads as well as it can, to minimize the amount that developers need to redesign their applications.

Please go to http://www.infinisql.org/community/ and follow the project on Twitter, and sign up for the newsletter, to keep apprised of progress.

Thanks for the input!

Oh, I thought you said 2PC (two-phased commit).

Yes, InfiniSQL uses two phase locking. The Achilles' Heel is deadlock management. By necessity, the deadlock management will need to be single-threaded (at least as far as I can figure). No arguments there, so deadlock-prone usage patterns may definitely be problematic.

I don't think there's a perfect concurrency management protocol. MVCC is limited by transactionid generation. Predictive locking can't get rolled back once it starts, or limits practical throughput to single partitions.

2PL works. It's not ground-breaking (though incorporating it in the context of inter-thread messaging might be unique). And it will scale fine other than for a type of workload that tends to bring about a lot of deadlocks.

HAT (http://www.bailis.org/papers/hat-vldb2014.pdf) looks pretty promising in terms of multi-term transactions. It turns out that you can push the problem off to garbage collection in order to make transaction id generation easy, and garbage collection is easier to be sloppy and heuristic about. The only problem is it isn't clear yet that HATs are as rich as 2PL-based approaches, and that nobody's built an industrial strength implementation yet.

TransactionID generation, as you mentioned it, is probably being limited by the incredible expense of cross-core/socket/etc. sync.

Go single-threaded and divide up a single hardware node (server) into one node per core, and your performance should go way up. You'd want to do something like this anyways, just to avoid NUMA penalties. But treating every core as a separate node is just easy and clean, conceptually. I/O might go into a shared pool - you'd need to experiment.

I've seen this improvement on general purpose software. Running n instances where n=number of cores greatly outperformed running 1 instance across all cores.

Only major design change from one node/proc is that your replicas need to be aware of node placement, so they're on separate hardware. You may even consider taking this to another level, so that I can make sure replicas are on separate racks. Some sort of "availability group" concept might be an easy way to wrap it up.

Also: your docs page clearly says 2PC was chosen (it's in the footnote). Maybe I'm misreading what "basis" means.

Hey Mark. The overview seem to be much of the same. There's a lot of excited talk, which is fine, but should be limited to a leading paragraph. The fundamental issue is performance in face of transactions that need to do 2PC among multiple nodes (which also need to sync with the replicas).

I'm not much of an expert at all, but I like reading papers on databases. It seems to me that if you really did discover a breakthrough like this, you should be able to distill it to some basic algorithms and math. And a breakthrough of this scale would be quite notable.

If I'm reading correctly, there's no replica code even involved ATM. So 500Ktx/s really boils down to ~83Ktx/sec per node, on an in-memory database. Is it possible on modern hardware that this is just what to expect?

I am curious, and I'm not trying to be dismissive, but the copy sounds overly promising, without explaining how, even in theory, this will actually work. I'd suggest to explain that part first, then let the engineering come second.

Your advice that I create formal academic-style paper is reasonable, and I agree that it should be something that I pursue. Will you follow me somehow (by links at http://www.infinisql.org) so that when such is produced, you'll see it? I can't guarantee getting front page here again, and don't want to be missed in the future, especially as you (and others) have been asking for this type of information.

And, yes, many thousands of transactions per node in memory is what should be expected. But scalability of ACID (lacking durable, as discussed) transactions on multiple nodes--that's the unique part. I'll try to distill that into a paper.

I signed up for the newsletter.

It doesn't have to be formal and academic enough to be published. Just something that explains how performance is going to be achieved - any sort of analysis.

Looking at the "About and Goals" section of their docs http://www.infinisql.org/docs/overview/#idp37033600

I can't seem to find the word "Reliable" or any variation thereof anywhere in there.

In fact, that word is no where to be found on the blog post or on the entire InfiniSQL page (not in the Overview, Guides, Reference or even FAQ). I find this quite remarkable since reliability is the true virtue of an RDBMS, not speed or even capacity. At least that's what PostgreSQL aims for and this being another RDBMS, and is also open source, I see it as InfiniSQL's only direct competitor.

It's nice that this is scalable, apparently, to ridiculous levels, but if I can't retrieve what I store in exactly the same shape as I stored it, then that's a bit of a buzz kill for me.

Can we have some assurance that this is the case?

There's a note on "Durability" and a shot at log file writing for transactions, and presumably InfiniSQL uses concurrency and replicas, to provide it. In the Data Storage section, it mentions that InfiniSQL is still an in-memory database for the most part http://www.infinisql.org/docs/overview/#idp37053600

What they're describing is a massively redundant, UPS backed, in-memory cache.

Am I wrong?

Hi, eksmith. I talk a bit about plans for durability in that overview document.

I promise that I have every intention of making InfiniSQL a platform that does not lose data. I have a long career working in environments that demand 100% data integrity. If I de-emphasized it, it was not intentional.

PostgreSQL doesn't scale for OLTP workloads past a single node. There are a handful of products similar to InfiniSQL (google for the term NewSQL for a survey of them).

And yes, a redundant UPS-backed in-memory cache. I have some ideas on how to do regular disk backing as well (which I'm sure you've read).

And if a more traditional log-based storage layer is added, InfiniSQL will still scale nearly linearly across nodes horizontally. Multi-node scale and in-memory are not dependent on one another. Though I believe that redundant UPS systems managed by a quorum of administrative agents, and provide durability just like writing to disk.

Are you familiar with high end storage arrays, such as from HDS or EMC? They write to redundant memory, battery backed and managed by logic in the arrays. I'm just moving that type of design to protect the database application itself, up from the block layer.

And some people trust their datacenter power--they use pure in-memory databases without UPS already, or they do things like asynchronously write transaction log, which also sacrifices durability. For those groups, InfiniSQL ought to be just fine, without UPS systems.

The write bottleneck for traditional databases has never been the write-ahead log, with group commit and a battery-backed RAID controller you'll have a hard time saturating the disk with log writes. The bottleneck has always been random I/O induced by in-place updating indexes based on B-trees. You don't need to be in-memory if you use better data structures. TokuDB and TokuMX are proof of that.

Hi, Leif. It's not hard to get to the throughput limits of a single log device, even on a fast array. I've done it on Sybase, WebSphere MQ, Oracle, MySQL, basically on enough platforms that I assume it to be the general case. The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays. But imagine getting rid of the transaction log entirely--the entire code path. That will be faster even than a transaction log write to memory-backed filesystem.

But I agree that other write (and read) activity going on in the background and foreground, also limits performance--and in fact, I've seen the index write bottleneck that you describe in real life, more-so than simple transaction log writes. So, you're correct.

I've read about Toku, but I really doubt that it writes faster to disk than writing to memory. Are you really trying to say that?

I think it would be great for InfiniSQL to be adapted to disk-backed storage, in addition to memory. The horizontal scalability will also apply, making for a very large group of fast disk-backed nodes.

I think your input is good.

I'm not talking about the speed of the array, rather a battery backed controller. With one of those, as long as you're under the sequential write bandwidth of the underlying array, it pretty much doesn't matter how much you fsync the log, so that bottleneck (commit frequency) goes away.

If you're planning to write data faster than disk bandwidth, then you have no hope of being durable and we're talking about problems too different to be worth comparing, and in that case I retract my comment.

I don't understand what distinction you're trying to make between the "array itself" and the "log file has a limit to how many blocks can be appended". Can you clarify what limit you're talking about?

Well, array=battery backed controller with a bunch of disks hanging off of it. Actually, there is latency associated with every sync. What I've seen on HDS arrays with Linux boxes and 4GB fibre channell adapters is about 200us per 4KB block write. That is very very good for disk. It's also slower than memory access by many orders of magnitude. This was about 3 years ago. Things are bound to be faster by now, but still not as fast as RAM.

I don't think it's unreasonable to want to write faster than an I/O subsystem can handle. Maybe it's not for every workload, but that doesn't mean it's for no workload.

The distinction I wasn't being clear about was that the storage array (the thing with the RAID controller, cache and disks hanging off of it) is not being saturated if a single transaction log file is being continuously appended to. But that headroom in the array does not translate to more throughput for the transaction log. I don't know if it's an important distinction.

>The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays

yes, the issue usually isn't the transaction log append speed. Instead, it happens too frequently that the log is configured to be too small. A log file switch causes a flush of accumulated modified datablocks of tables and indexes [buffer cache flush in Oracle parlance] from RAM to disk. With small log file size, the flush happens too frequently for too small amounts of modified data - this is where GP mentioned random IO bites in the neck.

I think you're talking about an insert buffer, not a transaction log, and in that case, no matter how big your insert buffer is, it will eventually saturate and you'll end up hitting the performance cliff of the B-tree. You really need better data structures (like fractal trees or LSM trees) to get past it.

no, i'm talking about transaction log ("redo log" in Oracle parlance). Switching log files causes checkpoint (ie. flush - it is when the index datablocks changed by the inserts you mention will finally hit the disk )


MS and DB2 have similar behavior.

Oh ok, now I see what you're saying, it's still similar to an insert buffer in that case. B-tree behavior is still to blame for this, and if you make your log file bigger it lets you soak up more writes before you need to checkpoint but you'll either have even longer checkpoints eventually, or you'll run out of memory before you have to checkpoint.

We also checkpoint before trimming the log, but our checkpoints are a lot smaller because of write optimization.

>even longer checkpoints

yes, that is the point as big flush instead of many small ones would take either the same or, usually, less time than cumulative time of small flushes because of IO ordering and probability of some writes hitting the same data block.

An in-memory RDBMS hardly seems to be "infinitely scalable". How would this work with DBs in the terabyte size or larger?

A terabyte of RAM is pretty cheap. Around $12K for the RAM. Last I quoted out a system for VoltDB, the total cost (complete servers with CPU, disk, RAM) came to ~$17/GB to $22/GB.

If you actually have transaction processing at this scale and need that performance, the RAM cost is not a major issue.

Well, 2-way Cisco servers can hold 1TB RAM each.

It scales as long as throughput increases while new nodes are added. I've done benchmarking up to 12 nodes, and it continued to scale nearly linearly. (http://www.infinisql.org/blog). I'd like to push it further, but need $$$ for bigger benchmark environments.

Badly. But scaling in dataset size, and scaling in performance are not the same thing. Busy eshop might need no more than 5 GB of space (growing 100 MB per month or something) but require very high speed.

This is what Clustrix (YC company) claims to do.

Hi, amalag. Yes, Clustrix is very similar to InfiniSQL (not to mention having been around longer). I believe that InfiniSQL has vastly higher performance at least for the type of workloads that InfiniSQl is currently capable of. InfiniSQL is also open source.

I hope there's room for competition in this space still.

What do you base your performance claim vs Clustrix on?

Here is some back of napkin analysis:

Starting with this benchmark report: http://www.percona.com/files/white-papers/clustrix-tpcc-mysq...

Basically, InfiniSQL does not currently support complex indices, so it can't do a TPC-like transaction.

The maximum throughput on 9 nodes is 128,114 per node per second. I don't know if that's 4 or 8 core nodes. If roughly 10% of transactions are multi-node transactions, then 12,811/node/s for multi-node, and 115,303/node/s for single-node transactions.

I don't know if full redundancy for Clustrix was configured, or a hot spare, so I don't know how many actual usable nodes were configured, but likely fewer than 9. But I don't know the precise number.

Roughly 10% of those transactions are contain records on multiple nodes. Based on 9 nodes, that means about 12811/node/s for distributed transactions combined with 115303/node/s for single node transactions.

InfiniSQL maxsed at over 530,000 multi-node transactions on 12 x 4-core nodes. http://www.infinisql.org/blog/2013/1112/benchmarking-infinis...

That's 44,167 per node.


These were not apples-apples benchmarks, but Clustrix performed about 12,000 multi-node transactions per node per second, along with a whole bunch more single-node transactions.

I don't know how it would perform on the benchmark I used. And I intend to do a tpcc benchmark once InfiniSQL is capable of complex keys (among whatever else it it currently is missing).

Several problems here:

1. Unlike your dataset, the tpcc dataset for the benchmark was not memory resident. Total dataset size was 786G. Just shrinking the dataset to fit in memory would substantially change the numbers.

2. The tpcc workload is much more complex than your benchmark. It doesn't make sense to compare a tpcc transaction, of which there are multiple flavors, to your workload.

3. All Clustrix write transactions are multi-node transactions. There's a 2x data redundancy in the test. We do not have a master-slave model for data distribution. Docs on our data distribution model: http://docs.clustrix.com/display/CLXDOC/Data+Distribution.


Now we've also done tests where the workload is a simple point updates and select on memory resident datasets. For those kinds of workloads, a 20 node cluster can do over 1M TPS.

What's up with the weird coding standards? Include files named infinisql_*.h and #line statements... strange.

Oh, the infinisql_*.h is because I deploy all header files as part of "make install", when what I really should do is boil it down to just the api header. The api is for stored procedure programming. Yes, I have it on backlog to fix. I give them all that name in case somebody installs to /usr/local (which you probably oughtn't) it's clear what application they all belong to. Yes, I could create a subdirectory, too. But the fix will be when I clean up api.cc to only have to pull in the one header instead of several of them.

#line statements because I get compiler messages from time to time putting things on the wrong line after having imported headers.

What is the difference to Teradata or Netezza except this is open source and lack the burden of universality yet?

Those are analytics databases, also known as data warehouses. Optimized for batch reporting. InfiniSQL is geared for operational/transactional (OLTP) kinds of workloads.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact