

Ask HN: tutorials on using the file system for storing data? - Tichy

I have to admit it, I am always extremely scared of simply using the file system for storing data. What if something happens in the middle of a write operation? How to avoid data corruption?<p>Another thing is I am not sure how to write data effectively, or is that covered with random access files? Like say some data item in the middle of the file changes, can I just change it on the spot?<p>Just saying, I am clueless about this. Maybe there are some tutorials for people who want to understand this alternative to using databases?
======
Hoff
For transactional file systems and transactional databases, the lower-level
software deals with this for you. This whether a transactional SQL, or Mac OS
X Core Data and its undo, or otherwise. ACID is goodness.

For non-transactional databases (and non-transactional file systems in
general), look at the concept of "careful writes". At its simplest, you seek
to allocate and work and read and write structures outside of the live
application data structures and only add your structures into the static
storage with a single-block or other canonical write as the last step of the
update or change. To always avoid having inconsistent structures.

In the event of an application or system crash, you can (will?) need a clean-
up daemon that finds and releases any dangling allocations.

And one that threw me: there are cases where multiblock writes might not see
all blocks written. Some storage devices might either cache the data, or might
(due to a power failure) not write all blocks.

You'll find various discussions and papers on "careful writes" around. And
ACID. And related.

Once you get the hang of this sequencing, the next level of complexity upwards
here can involve distributed access and coordinating and sequencing write
operations. This can involve locking or queuing.

------
ajross
Generally, sane file metaphors try very hard _not_ to change stuff in the
middle of the file. Try to keep with one "record" (which might be a bigger
object than a single row in a database!) per file, and read/write them all in
one swoop. If you need indexes other than the one you get for free (the path
name, of course), keep them in separate files and make sure your toolchain can
rebuild them as needed.

The filesystem has looser rules about data loss than the database will by
default, so unless you want to handle this yourself (which can be done), you
should probably to turn on "ordered data" journalling in your filesystem (e.g.
for ext3: "mount -o data=ordered"). Then you only need to be able to recover
from a crash (or killing) of your own software, which is a problem you'll have
to handle anyway.

The classic application for this sort of technique is a mail server. Software
like sendmail and postfix has been doing reliable on-disk storage for decades
now, and they don't have to jump through too many hoops to do it.

------
bayareaguy
I would recommend reading Transaction Processing: Concepts and Techniques by
Gray and Reuter, particularly the sections about ways to construct Atomic
commits.

SQLite's atomic commit design is also good food for thought -
<http://www.sqlite.org/atomiccommit.html>

However, while it's good to think about low-level consistency (e.g. disk
writes) I think you're better off spending your time on your application's
high-level consistency (e.g. procedures for identifying inconsistencies,
logging your changes, recovering from backups). Your os developers have
already spent countless hours on the issue of getting your blocks to disk, but
you may be the only one who has spent any time thinking about what things are
important to preserve in your application.

------
Maro
Maybe you want something like Berkeley DB aka. libdb

It's a library that you link to your application which exposes a database API
for storing and retrieving records in tables (which are stored in files). No
DB server is involved, and since you use the programmatic API, there is no
need for SQL (and it's not supported).

Reference guide: [http://www.oracle.com/technology/documentation/berkeley-
db/d...](http://www.oracle.com/technology/documentation/berkeley-
db/db/ref/toc.html)

------
elad
I started using the filesystem for storage a while back, and then realized
that I'm re-implementing a lot of DBMS functionality in my own buggy code. I
switched over to using a database.

I'm just saying that you should consider all of your requirements first,
figure out how much code you're going to need that just manages your data, and
then decide whether a database is really such a bad idea.

~~~
Tichy
I have no problems with databases, but anti-dab articles crop up on Hacker
News every now and then. I was wondering if my fixation on db's would hamper
my progress...

------
yaj
Among other document-oriented databases mentioned, also check out CouchDB
<http://incubator.apache.org/couchdb/>

------
jsjenkins168
Have you considered using Amazon SimpleDB? Its a very simple and fast way to
persist data. Avoids the scaling and implementation headaches associated with
a typical database.

~~~
Tichy
I did not really have headaches with databases yet. One issue might be not
having database drivers, though (LISP programming???). Another might be use
cases that don't match databases well. For example I could imagine search
engines would be better off saving their data in a different way (specialized
index files). Or twitter-like things?

~~~
jdale27
"One issue might be not having database drivers, though (LISP
programming???)."

Huh?

~~~
Tichy
I know there are probably drivers, I just have not figured it all out yet
(MzScheme). What about orm mappers, for example?

Sorry, that really wasn't meant as a criticism of LISP.

~~~
jdale27
Well, I'm not too familiar with MzScheme, but <http://planet.plt-scheme.org/>
would probably be a good place to look.

Common Lisp certainly does have good database support: CLSQL probably supports
whatever combination of OS, Lisp implementation, and RDBMS you need.

ORM tools and other object persistence solutions are there too.
<http://common-lisp.net/> is a good source.

------
typedlambda
consider using sqlite made for being an allmost universal fileformat also
allowing transactions. Its fully ACID (Atomicity, Consistency, Isolation,
Durability) and even allows concurrent access (serializes writes)

it's probably the most deployed "database" (thousends of embedded devices,
it's the fs one some ;-) .

