I can't help but think that this announcement is 5 years too late. 5 years ago the field for write-optimized database was wide open and obviously there was a lot of demand for such a product or otherwise companies like facebook and Powerset wouldn't have had to write their own versions.
But since there was no open source software available for this usecase back then, they did write their own, which became Cassandra and HBase. And now these SSTable type of solutions have become the de facto standard for write-optimized databases.
If Tokutek had open sourced TokuDB from the very beginning things could have turned out very differently. TofuDB had an huge advantage over Cassandra/HBase in being SQL-oriented and based on MySQL, whereas Cassandra/HBase required a much steeper learning curve. But since TokuDB was proprietary, it never really gained much traction.
One big differentiating feature between TokuDB and other write optimized databases is that TokuDB is fully transactional, it is ACID compliant. There is a demand for this.
AFAIK, TokuDB is the only write-optimized transactional data store. We (I work at Tokutek) are working on integrating with MongoDB as we speak, and when we do, we will bring transactional semantics to MongoDB in addition to improved performance.
So, I think there is still a big demand out there that TokuDB will be able to meet
My impression was that TokuDB is a single-server database, as a backend for MySQL. In that case it wouldn't be a direct competitor to HBase and Cassandra, both of which you normally wouldn't even consider unless you expect to have a cluster of machines.
You'd arguably be taking a risk by choosing a centralized database for write-heavy applications, which are the hardest to scale, but I could see it as a good fit for the applications where people currently use sharded MySQL.
Cassandra and HBase Shard, they just shard differently since the shard is baked into the system while in MySQL its not. If you get right down do it, When you Shard on Toku, it will simply be more efficient at a storage level.
The main technical underpinning (though note this isn't my area) seems to be basing indices on a particular improvement to B-trees, Cache-Oblivious Streaming B-Trees, which have some nice performance characteristics. I think what TokuDB calls "fractal trees" are just a catchier synonym.
Does this mean that it's now safe for folks to write their own versions of the data structures that tokutek has patents on? They hold patents on some of the more natural choices In cache oblivious b trees (admittedly they also invented those retrospectively natural choices).
Oh what a world. A data structure should never be patentable - what's the difference between a data structure and a math formula? Both are just notations for reality, and can spring up in the minds of many disparate inventors.
I'd really like to see TokuDB discuss its patents, why they have them, and what they intend to do with them. I'd like to see a page like that from every corporation on their website actually.
"A data structure should never be patentable - what's the difference between a data structure and a math formula? Both are just notations for reality, and can spring up in the minds of many disparate inventors."
Can we please let this disingenuous argument die already? Yes, in theory they can spring up in many minds independently. In practice they rarely do unless they are trivial. If the alleged disparate inventors can prove they came up with it independently, fine; the onus is on them.
That's absolutely true, it's far too easy to get in a situation like that even without considering patent law. I just think patents make the scatter-gun approach too accesible, and the damages are far too high. Software doesn't need to protect R&D costs on FDA studies and years of testing. If the goal is to foster innovation, either software patents should be abolished, or there should be very reasonable limits to damages, or there should be a simple and inexpensive way to trigger a review that invalidates ridiculously obvious patents such as "scan-to-email" and friends.
I'm unfamiliar with this particular implementation, but it sounds (from your one sentence description) kinda like Dancing Trees, created by Hans Reiser, for ReiserFS. Would this be prior art, or would this be different approaches to similar problems?
Yes and no. I think they're doing a variant of cache oblivious buffered B epsilon trees. Ill have to go code spelunking (now that the source is available. Boy am I glad it's gplv2 and not some (a)gplv3 variant).
For the patent side of things, it being GPLv2 is sort of unfortunate compared to GPLv3, because v3 contains a patent grant, whereas v2 doesn't. It's possible some kind of implicit patent grant could be read into it by a court, e.g. that by open-sourcing some software, a company is making an open offer to use/modify the software, and then if they turned around and sued you for doing so, some equitable doctrine like estoppel would stand in the way. But an explicit patent grant is a lot clearer.
For companies wary of v3 for other reasons, I wonder if there's an easy/semi-standardized way to tack on a patent grant? I'd feel much safer using open-source software licensed under some kind of "MIT + patent grant" or "GPLv2 + patent grant" license than the vanilla versions.
If you are interested in this, you should look at our source distribution's license file. We have done the work to get a patent license from MIT, Stony Brook, and Rutgers so that the users of the GPL'd code can use the fractal tree patents.
Indemnification is a completely different issue: no open source software that I know of provides any kind of indemnification which would protect you from a lawsuit by a third party.
Ah excellent, sorry for missing that. Yeah, I was just looking for a patent grant, not indemnification. I saw the notice of GPLv2, but hadn't seen the additional terms. If anyone else is looking for those, they're located here (possibly among other places): https://github.com/Tokutek/ft-index/blob/master/README-TOKUD... (ctrl+f "PATENT RIGHTS GRANT").
Hmmm, after reading the simple explanation PDF of Fractal Trees, they sound more like Dancing Trees than I'd assumed, but I am a poorly informed outsider. The key feature, from a performance perspective, is that writes are minimized by making each write do more, though both also help with SSD wear as a side effect. So, from the outside, it looks like both accomplish many of the same goals, and they do it by fundamentally altering B-tree implementation with modern hardware in mind.
But, this isn't my area. I only have a passing understanding of Dancing Trees (my previous business led me to following ReiserFS closely, but I haven't paid any attention to filesystem or database performance in 7+ years), and no real understanding of Fractal Trees. I'm glad it's Open Source, however, and thanks for the pointers.
Dancing Trees are algorithmically the same as B+ trees, if I understand the wikipedia article correctly. With just a uniformly random workload with a large enough working set, one would behave pretty much exactly the same as a B+ tree.
Well ... I think it's actually pretty good bug reporting habit. Was it text/plain or unstyled text/html? Was the body an error message but a styled header and footer present? You might want to copy and paste the text as well, to aid future copy and paste efforts, but it's surprisingly frequently useful to have a screenshot.