
Zero load time file formats - LiveTheDream
http://stevehanov.ca/blog/index.php?id=123
======
leif
Too bad this guy never took a data structures course. Binary search on an
array is O(log N) in the DAM model, while search in a B-tree is optimal: O(log
N/log B).

Note that if your elements are english words, let's be extremely generous and
say they are 12 bytes on average, with a 4-byte offset, that's 16 bytes per
element, with a disk block size of 512K (and it's usually more), that means B
is 32K, and my B-tree is 15 times faster than your array.

Of course, if your data set is only 11MB, who cares? You can read that in
under the time it takes you to seek to the file.

~~~
ynniv
Indeed, this post falls into the category "things that people often
rediscover". Using storage optimized data structures and algorithms is very,
very old.

[ <http://en.wikipedia.org/wiki/Merge_sort#Use_with_tape_drives> ]

~~~
leif
funny you should bring up merge sort: in my first google phone screen, I was
asked "why is merge sort good" and I gave "it's asymptotically optimal and
stable" and couldn't come up with anything else, the interviewer eventually
conceded that I was too young to know that it was also better than other
sorting algorithms on tape :)

~~~
andrewf
CS professors always bring up the tape example, but it's true for any medium
where linear access is faster than random, which becomes applicable as you
move down from L1 cache -> memory -> local disk -> network.

Linked lists, too :)

~~~
leif
of course, only years later would I understand the proof of why: M/B-way merge
sort is optimal in external memory

------
jwilliams
An aspect to keep in mind if you do this is (file) portability. Since it's
tied to the data structure you could inadvertently tie the data format to a
specific architecture -- e.g. endianness, integer sizes, 32 vs 64 bit, etc.

~~~
mcpherrinm
That's certainly a problem that exists, but isn't one that's terribly hard to
solve.

Network programmers have had this problem from the beginning, with distinct
machines communicating. The same techniques can be used when writing to a file
instead of a network stream: Explicit integer sizes and endianness fixes this.

~~~
saurik
... but then you don't have a "zero load time file format", at least on
systems that have different endianness than the file was generated. (At least,
unless you are willing to have the entire program not assume the endianness of
its own internal data structures, which would be kind of insane.)

~~~
palish
In the gamedev world, typically little endian is "the fast one", and big
endian is where conversion is required.

(This is merely because Windows is the dominant gamedev platform.)

------
dexen
<http://cmph.sourceforge.net/>

Pretty much read-only format (you can't modify file, only create new one from
scratch), but it's darn fast at O(1). Also very compact.

Used the PHP wrapper around it once:
<http://www.php.net/manual/en/book.chdb.php>

~~~
leif
see also cuckoo hashing

------
palish
This isn't necessarily true.

A file is divided into 4096-byte "pages" on disk, I think. Each page can be
stored in a different physical location, completely disjoint from one another.
So you would have to wait while the harddrive seeks the correct location. (Or
get an SSD.)

However that's probably a pedantic observation.

~~~
Scaevolus
They're called sectors, not pages.

You're describing worst-case fragmentation, which filesystems work very hard
to prevent.

In general, the fastest thing to read off a disk is a single large file,
assuming minimal fragmentation. Even SSDs have much better sequential read
performance compared to random-access.

------
kelnos
Since when was disk access instantaneous?

