Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are several use cases where this is a sure-fire way of shooting yourself in the foot:

* if you have many records (i.e. more than a couple hundred), the file system will have a lot of work to do and the whole thing becomes sluggish

* if you want to query the data by content, there's nothing that gives you sublinear search capability here

* it's not easily possible to modify data under this scheme. If you add that functionality, you'll have the familiar choice between race conditions and added complexity. Having said that, if you don't modify the data you can also remove the store and just use the data itself (maybe encrypted if you need that) instead of the key.

As an alternative to this, consider each process appending to a file and keeping filename+offset as the identifier for a particular record. This solves at least the "too many files" problem.

Or, if you care for reading a static collection, put your Json (or some moral equivalent, e.g. msgpack) into a CDB database: http://cr.yp.to/cdb.html

Next step up: use LevelDB, or KyotoCabinet/KyotoTycoon to organize the storage.



Whether a large number of files becomes sluggish really depends on your file system. In any case, a common technique is to break large numbers of files into subfolders which usually does reasonably well at solving this problem.

As for updating, flock[0] solves this issue on operating systems which support it.

[0] http://linux.die.net/man/2/flock


Usenet news and maildir are cases where current operating systems already have to cope with that kind of load, so it's definitely possible.

The question is, can this be useful without becoming a partial and bug-ridden reimplementation of a NoSQL database (just because we have NoSQL databases that fit the bill and carry less maintenance costs wrt a spit-and-glue solution).


ReiserFS (v3) is a great small filesystem that's fantastic at lots of small files (and also copes great with power outage events, like on my laptop for the last 15 years). I've had tons of issues with ext3/4 (running out of extents, slow performance on lots of small files), btrfs (running out of metadata space when I still have hundreds of GB left?!), xfs (great on everything except lots of tiny files or power loss). It even supported reliable shrinking and growing on LVM.

It's too bad no one is supporting it anymore, since the founder is in prison and the only other people who seem to be able to support it seem to be focused on a Reiser4 pipe dream instead of supporting great, reliable technology that had most of the bugs worked out a long time ago.


> if you have many records (i.e. more than a couple hundred), the file system will have a lot of work to do and the whole thing becomes sluggish

On OS X, running the simple test which creates, reads, and deletes 1,000 documents is no problem. The big concern is reaching the maximum number of inodes or files in a single directory limit. This can be worked around by "sharding" into sub-directories based on the first character of the UUID.


wiredtiger is faster that LevelDB in my experience.


Interesting, hadn't heard of it. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: