Hacker News new | comments | show | ask | jobs | submit login
TokuDB open sourced (tokutek.com)
89 points by porker 1254 days ago | hide | past | web | 52 comments | favorite

I can't help but think that this announcement is 5 years too late. 5 years ago the field for write-optimized database was wide open and obviously there was a lot of demand for such a product or otherwise companies like facebook and Powerset wouldn't have had to write their own versions.

But since there was no open source software available for this usecase back then, they did write their own, which became Cassandra and HBase. And now these SSTable type of solutions have become the de facto standard for write-optimized databases.

If Tokutek had open sourced TokuDB from the very beginning things could have turned out very differently. TofuDB had an huge advantage over Cassandra/HBase in being SQL-oriented and based on MySQL, whereas Cassandra/HBase required a much steeper learning curve. But since TokuDB was proprietary, it never really gained much traction.

One big differentiating feature between TokuDB and other write optimized databases is that TokuDB is fully transactional, it is ACID compliant. There is a demand for this.

AFAIK, TokuDB is the only write-optimized transactional data store. We (I work at Tokutek) are working on integrating with MongoDB as we speak, and when we do, we will bring transactional semantics to MongoDB in addition to improved performance.

So, I think there is still a big demand out there that TokuDB will be able to meet

My impression was that TokuDB is a single-server database, as a backend for MySQL. In that case it wouldn't be a direct competitor to HBase and Cassandra, both of which you normally wouldn't even consider unless you expect to have a cluster of machines.

You'd arguably be taking a risk by choosing a centralized database for write-heavy applications, which are the hardest to scale, but I could see it as a good fit for the applications where people currently use sharded MySQL.

Cassandra and HBase Shard, they just shard differently since the shard is baked into the system while in MySQL its not. If you get right down do it, When you Shard on Toku, it will simply be more efficient at a storage level.

> Tokutek, a leader in high-performance and agile database storage engines....

"Database Error: Error establishing a database connection"


TokuDB is great technology, glad to see them embracing open source. Too bad their wordpress installation needs some tuning!

The site is down. What is great in this technology ?

The main technical underpinning (though note this isn't my area) seems to be basing indices on a particular improvement to B-trees, Cache-Oblivious Streaming B-Trees, which have some nice performance characteristics. I think what TokuDB calls "fractal trees" are just a catchier synonym.

Here's an academic paper from a few years ago from some of the people involved: http://supertech.csail.mit.edu/papers/sbtree.pdf

And a more recent talk focused on TokuDB: http://www.bnl.gov/csc/seminars/abstracts/Bender_Presentatio...

We have some more descriptions of Fractal Tree Indexes at http://tokutek.com/what-is-a-fractal-tree, and I'm always happy to answer questions about it.

For starters, it totally changes the database performance paradigm for managing tables with many indexes.

"Error establishing a database connection" - Irony.

Shit happens - to everyone, even DB guys. Even with DB's. And I'd say more often with DB's to DB guys than with DB's to anyone else ;)

We're trying to contact the website folk but they're mighty busy announcing things right now. I'll let you all know what happened soon. In the meantime, here's a press release you can read: http://www.marketwire.com/press-release/tokutek-meets-big-da...

Also, we're on github: http://github.com/Tokutek and on IRC at #tokutek on irc.freenode.net, so come hang out!

Sadly, we don't use TokuDB to run our web site. It's some relatively inexpensive service.

Does this mean that it's now safe for folks to write their own versions of the data structures that tokutek has patents on? They hold patents on some of the more natural choices In cache oblivious b trees (admittedly they also invented those retrospectively natural choices).

Oh what a world. A data structure should never be patentable - what's the difference between a data structure and a math formula? Both are just notations for reality, and can spring up in the minds of many disparate inventors.

I'd really like to see TokuDB discuss its patents, why they have them, and what they intend to do with them. I'd like to see a page like that from every corporation on their website actually.

"A data structure should never be patentable - what's the difference between a data structure and a math formula? Both are just notations for reality, and can spring up in the minds of many disparate inventors."

Can we please let this disingenuous argument die already? Yes, in theory they can spring up in many minds independently. In practice they rarely do unless they are trivial. If the alleged disparate inventors can prove they came up with it independently, fine; the onus is on them.

The onus placed on them is too great when it appears as a dilemma between a yearly $X0,000 extortion or one-time $X,000,000 lawsuit.

Fair point, though the lawsuit cost insanity is hardly limited to patents.

That's absolutely true, it's far too easy to get in a situation like that even without considering patent law. I just think patents make the scatter-gun approach too accesible, and the damages are far too high. Software doesn't need to protect R&D costs on FDA studies and years of testing. If the goal is to foster innovation, either software patents should be abolished, or there should be very reasonable limits to damages, or there should be a simple and inexpensive way to trigger a review that invalidates ridiculously obvious patents such as "scan-to-email" and friends.

In practice, independent invention happens all the time, and we end up with absurd patent lawsuits as a result.

AFAIK none of these ludicrous patents are about data structures, algorithms or math formulas.

I find that somewhat difficult to accept, given that pretty much every single patent that touches software is ludicrous.

Can someone in the know elaborate on the current state of US patent law here? And what exactly TokuDB / the universities hold a patent on?

Curious outsider here, and understand that all IANAL caveats apply.

IANAL, but the patents are owned by the universities where the founders work, not by the company. I don't know if this makes you feel any better though.

I'm unfamiliar with this particular implementation, but it sounds (from your one sentence description) kinda like Dancing Trees, created by Hans Reiser, for ReiserFS. Would this be prior art, or would this be different approaches to similar problems?


Fractal Tree indexes are very different from Dancing Trees. See http://tokutek.com/what-is-a-fractal-tree for an overview, and I'd be happy to answer questions if you have them.

Hmmm, after reading the simple explanation PDF of Fractal Trees, they sound more like Dancing Trees than I'd assumed, but I am a poorly informed outsider. The key feature, from a performance perspective, is that writes are minimized by making each write do more, though both also help with SSD wear as a side effect. So, from the outside, it looks like both accomplish many of the same goals, and they do it by fundamentally altering B-tree implementation with modern hardware in mind.

But, this isn't my area. I only have a passing understanding of Dancing Trees (my previous business led me to following ReiserFS closely, but I haven't paid any attention to filesystem or database performance in 7+ years), and no real understanding of Fractal Trees. I'm glad it's Open Source, however, and thanks for the pointers.

Dancing Trees are algorithmically the same as B+ trees, if I understand the wikipedia article correctly. With just a uniformly random workload with a large enough working set, one would behave pretty much exactly the same as a B+ tree.

Yes and no. I think they're doing a variant of cache oblivious buffered B epsilon trees. Ill have to go code spelunking (now that the source is available. Boy am I glad it's gplv2 and not some (a)gplv3 variant).

Thanks for the interesting reference!

For the patent side of things, it being GPLv2 is sort of unfortunate compared to GPLv3, because v3 contains a patent grant, whereas v2 doesn't. It's possible some kind of implicit patent grant could be read into it by a court, e.g. that by open-sourcing some software, a company is making an open offer to use/modify the software, and then if they turned around and sued you for doing so, some equitable doctrine like estoppel would stand in the way. But an explicit patent grant is a lot clearer.

For companies wary of v3 for other reasons, I wonder if there's an easy/semi-standardized way to tack on a patent grant? I'd feel much safer using open-source software licensed under some kind of "MIT + patent grant" or "GPLv2 + patent grant" license than the vanilla versions.

If you are interested in this, you should look at our source distribution's license file. We have done the work to get a patent license from MIT, Stony Brook, and Rutgers so that the users of the GPL'd code can use the fractal tree patents.

Indemnification is a completely different issue: no open source software that I know of provides any kind of indemnification which would protect you from a lawsuit by a third party.

-Bradley (bradley@tokutek.com)

Ah excellent, sorry for missing that. Yeah, I was just looking for a patent grant, not indemnification. I saw the notice of GPLv2, but hadn't seen the additional terms. If anyone else is looking for those, they're located here (possibly among other places): https://github.com/Tokutek/ft-index/blob/master/README-TOKUD... (ctrl+f "PATENT RIGHTS GRANT").

Could you elaborate on the meaning of the phrase "this implementation"?

Eg if I write some apache2 Haskell code that can interact seamlessly with the rest of the storage engine implementation, and I swap that part In, is that still the same implementation?

Eeek. Didn't realize there was that delta on patents on v2 vs v3. Was aware of that problem with MIT/bsd though. Apache2 kinda is the MIT/bsd with explicit patent grants I guess.

I suppose that idemnification from code patent issues is probably part of the commercial license / product then?

Link to source code: https://github.com/tokutek

Was quite looking forward to reading this, but when I clicked the link, I was faced with: http://imgur.com/wwEogpx I think there's a certain irony seeing as this is a database related announcement!

You took a screenshot of plain text. Did "Error establishing database connection" not suffice?

Well ... I think it's actually pretty good bug reporting habit. Was it text/plain or unstyled text/html? Was the body an error message but a styled header and footer present? You might want to copy and paste the text as well, to aid future copy and paste efforts, but it's surprisingly frequently useful to have a screenshot.

LOL "Error establishing a database connection"

OK, we are all learning first hand here about the "slashdot" effect http://en.wikipedia.org/wiki/Slashdot_effect (now Hacker news effect).

We are working to get the site back up (a simple Wordpress site set up by the marketing department that does not use our database). In the meantime, please take a look at http://mwne.ws/124r2GL

Thanks for all the comments. We clearly didn't anticipate this level of excitement, but are grateful for the interest.

And our google groups: - tokudb-dev - tokudb-user

Ok, the website is starting to come back online after our huge spike in traffic.

Thanks for all the interest and for your patience...


Anyone have a cached page?

Any plans to have this as an indexing option for PostgreSQL?

Is there any testimonial using TokuDB?

Hope this helps, from Tokutek's website: http://www.tokutek.com/solutions/

Edit, oops followed parents link to this: http://www.tokutek.com/products/tokudb-for-mysql/:

"Tools such as Hot Backup (coming soon) allow a backup to be completed while database is running."

- Does that mean that someone using the opensource version would be unable to take a backup of a running database?

Someone using the open source version can take backups just as they would with our versions prior to 7.0:

- snapshots (LVM, EBS, etc...)

- cold backups

- mysqldump (with MVCC, this is technically hot)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact