Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: On-disk B+ tree for Python 3 (github.com)
159 points by nicolaslem on Jan 21, 2018 | hide | past | web | favorite | 21 comments

ZODB has a mature B+Tree implementation for on disk use in the BTrees package. https://pypi.python.org/pypi/BTrees

Nice, I'm a big user of various data stores for scientific work. How does this compare to LMDB (http://www.lmdb.tech/doc/), which uses B-trees (and has a Python interface)?

I haven't done any benchmark yet but I expect my implementation to be at least an order of magnitude slower.

I've found my implementation to be CPU intensive: creating Python objects from the raw pages is expensive. That's why bulk inserts and iterations are much faster than insert/get in a loop.

This is pretty cool. As I was reading the code I really wished for an explanation or simple ASCII diagram of the serialization formats for the various node/record types, as well as for the frames/wal format. Given that the poster is the author of the project, I hope you'll consider filling in these kinds of details, as they'd presumably be of interest to the people you're "Show"-ing this to.

Thank you. I know what you mean, I'm not happy with how the serialization is done right now, it's too complicated.

Maybe someone on HN knows a Python serialization library beyond pickle that would allow to describe how the data is laid out and take care of the rest. It looks like struct is not flexible enough for this usage.

The code looks very clean. Uses type annotations too. Wish there were more comments though.

Thank you! I found that type annotations make code easier to read so I tend to write less comments. But I'll make an effort.

How does this perform compared to sqlite3?

And and can you use keys that aren't builtin values?

It is possible to use your own key if it has a natural order and you write your own simple serializer: https://github.com/NicolasLM/bplustree/blob/master/bplustree...

Cannot delete items yet

Yes, it's a work in progress, see the `remove` branch.

That is a very dangerous name for a branch.

lol, truth

Why is this not mentioned very clearly in the README? Seems like willful misrepresentation.

You might also mention that, if replacing large values that use overflow pages, the file has the potential to grow without bounds as it looks like overflow pages are not collected?

Thanks for sharing, I've been interested in finding pure python implementations of storage engines. Another one I came across is whoosh: https://bitbucket.org/mchaput/whoosh/src/a16ebacb47191afaf2d...

Check out zodb and durus.

Are there any production data stores recommended for low memory usage? What should I use to stream data to disk and back with minimal overhead, preferably with indexed lookups?

An SSD with the highest random-read/write IOPS you can afford. Something like this:


Or FusionIO ioDrive2 Duo, which is above 900K IOPS for both reads and writes.

what about using a DB? Is there any really simple database for the same purpose?

What's wrong with LMDB?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact