What are the best non-database solutions you've seen? What did Viaweb use? - mattjaynes
======
pg
Keep everything in memory in the usual sort of data structures (e.g. hash
tables). Save changes to disk, but never read from disk except at startup.

~~~
dk
A similar approach is so-called "object prevalence"
(<http://en.wikipedia.org/wiki/Object_Prevalence).> Basically, keep all the
data in RAM and write a journal of changes. On startup, the journal is played
back, incrementally restoring the state. Snapshots are taken periodically to
keep the journal down to a manageable size.

It's a form of the Command design pattern and has some nice properties. You
can get transparent thread synchronization by executing queries in parallel
but serializing commands that modify state. For web apps, the HTTP request
offers a natural representation for commands (but of course you'd want to
strip them down to their essence). You can get fault-tolerance and scalability
by feeding the command stream to replica servers. (State-changing commands
must be executed by the master server but queries can be load-balanced across
the replicas.) And if you keep the journals around, you have a complete
history of the application's state.

EDIT: See also <http://www.advogato.org/article/398.html>

~~~
dk
I probably should have pointed out the implications for disk I/O. In many
cases, the serialization of the command in the journal has a smaller footprint
on disk than the data that's modified. Consider an extreme case where a small
HTTP POST touches dozens or hundreds of records.

And appending data to a log is essentially an ideal disk access pattern.

Of course this can't be said for the snapshots, but you can offload that task
to a replica server.

------
njharman
All you "must be stupid not to use rdbms" people need to pry open your minds a
little.

There are all sorts of apps that use mem/fs for storage. Like mail servers,
news servers, squid servers, static http servers, log files, and sql servers
(discounting direct io)

rdbms are (usually) just another layer ontop of fs. A very usefull layer if
you need what they offer. But if not just a layer of complexity and wasted
mem/cpu.

Modern fs have journaling and smart buffering/caching. They kind of rock and
discounting them out of hand is a mistake.

------
jey
I fully agree with what pg said about keeping everything you can in in-memory
structures. You can save modifications to a sequential logfile, or some other
simple on-disk structure that is easily maintained (e.g. separate file per
record), and read this structure into memory when you restart. Once that
becomes infeasible, I would switch to using SQLite's BTrees [1] directly and
bypassing the entire SQL layer. Using the SQL layer is just pretty pointless
and a waste of resources. You're going to have a "data access layer" over your
database anyway, so why not have it operate on a level where you have precise
control over the access patterns and can tune it for your application? If you
use the SQL layer, you're basically _losing_ all that information by having
your data access layer transform your operations into a high level data
manipulation language by _building a string_ (eww), then you leave the SQL
layer of the database engine to parse your string back to something sensible
and have it _guess_ at how to optimize the query to get the best performance.
Instead just tell it directly and skip the whole string construction, string
parsing, and query analysis crap.

Notes

1\. I'd suggest BDB, but it seems to suck. YMMV, always benchmark it yourself.

~~~
evgen
I would second the suggestion to try SQLite first (or give metakit a glance if
you want something that lets you iterate over a table quickly) and concur with
staying as far away from BDB as is possible...

The other advantage that using memory/fs has over a RDBMS for most simple
applications is that unless you are doing complicated joins you will keep
things simple and avoid introducing an additional layer of complexity and
point of failure into your app.

~~~
anupamkapoor
may you please explain why BDB sucks ? thanks !

~~~
ralph
Seconded, please explain. I don't have a view but was wondering about using
Berkeley DB, including its replication, to have the web server running on each
machine, with the BDB library linked in. Each machine would be a web server
but also one of the replication group, with one master and many replicas.
Writes go to the master, reads can be shared around the replicas.

BDB seems pretty flexible in policy decisions, e.g. the master can say the
commit is complete when it hears back from N replicas depite not having
written to disc itself yet.

------
jsjenkins168
If you need to store to disk at all, use a DB. The exceptions being if you
have very small amounts of data or it is not a big deal if its lost (like user
preferences or similar). You could just serialize those to flat files.

DBMS are optimized for the fastest possible disk I/O. At their very nature
they are also designed with data integrity in mind. It would be a joke not to
use one if you are handling important data where persistence is even remotely
a factor.

As you grow you'll be glad you implemented the DB code as it will scale much
better. You'll also be able to do Analysis easier which can be very important
in learning about usage of your product.

But I'm definitely a fan of utilizing ram as much as is allowed in the
interest of speed. Try to strike a good balance between data integrity and
performance...

~~~
dk
"DBMS are optimized for the fastest possible disk I/O."

I think that's disputable. DBMSes are designed for a number of considerations
and it's not hard to demonstrate how alternatives can outperform a DBMS in
terms of disk I/O and general performance.

Consider an HTTP request that modifies records. A typical DBMS-backed app will
write all the changes to disk whereas a prevalent system (AKA object
prevalence) need only write "POST /someurl arg1=value1:arg2=value2:..." or
some equivalent. The data is updated in RAM with a write to only one or two
disk sectors in the majority of cases, no communication with a DBMS, no
construction and parsing of SQL, and none of the other overhead. A typical
prevalent system will be orders of magnitude faster than the DBMS-backed
equivalent, and simpler to boot.

Object prevalence doesn't offer a query language and has different scalability
considerations, but it would take an absurdly broken design for a DBMS to
outperform it.

------
kogir
Perhaps this makes me just a programmer, and not a 'hacker', but I for one can
write SQL queries and design a good DB schema in far less time than I could
create something similar in RAM myself.

~~~
mattculbreth
I don't think so. Basically the idea is just to use whatever data structures
you were already using (lists, arrays, hash tables, etc.) and just keep them
around. Since you already have this in your code, you avoid doing anything
with a DB. There are tools in a variety of languages and frameworks to help
with the persistence aspects.

I don't see how that could take more time than doing both the normal program
data structures and the DB.

~~~
kogir
On further thought I think you're right. The need for joins can be alleviated
with pointers, and sorting and grouping can be done in the application.

The tricky part would be the synchronization and transactional aspects, and if
the system became sufficiently distributed that part could get fairly nasty.

~~~
SwellJoe
You're still missing the point. Joins and sorting and grouping in SQL are what
you use to populate the real data structures. Whether you know it or not, in
all but the most simplistic CRUD application, you've gotta write code to deal
with all of those data structures. The database is always in addition to that
complexity. The argument being made is that you can leave out the database
entirely in some environments, and not add that complexity.

The entirety of this thread is questioning the pervasive use of databases in
web applications. There are some where it makes excellent sense (accounting,
CRM, ERP, the stuff that's been the strong-hold of Oracle and SAS for years).
And others (wikis, blogs, forums, photo galleries, etc.) where it may not make
good sense, because you introduce hundreds of unnecessary operations and
significant additional database support code. That's not to say a database is
never the right solution for these problems (when your problem starts looking
relational, you should start looking at relational databases, because your
Ruby or Python or Lisp implementation is going to be worse than what
PostgreSQL or MySQL have). And, as someone else mentioned (but got modded
down)...when you start building your own flaky transactional layer, then it
may be time to consider a database that has good transactions support.

------
jkush
You can also reference this article:

<http://radar.oreilly.com/archives/2006/04/database_war_stories_2_bloglin.html>

~~~
mattjaynes
Awesome info, thanks ;)

~~~
jkush
Sure, glad you found it useful!

A few months ago, I created a silly little sudoku site, then blogged about how
I did it with no database backend (man, I got flamed). Truth is, I really
didn't need one but lots of people couldn't see past the fact that I wasn't
using the conventional database approach. In their minds, there was simply no
reason why I should have used a few flat files.

The whole idea of databases and flat files is such a polarizing topic. It's
terribly interesting.

------
npk
Ok, here's the obvious follow-up question:

What DB does news.YC use?

~~~
pg
Same as Viaweb. Hash tables in memory. Updates to disk, but no reads from disk
except at startup.

~~~
rmack005
Does the Arc/Lisp process only handle one request at a time? If it handles
multiple requests at once how do you keep the writes atomic?

------
cratuki
Allegro-cache looks interesting although I haven't tried it:
<http://franz.com/products/allegrocache/index.lhtml> I like the idea of having
a powerful datastore within the same process as the application because you
don't have the IPC overhead. Although, if you're using prolog you're probably
going to be less likely to want to do complex things in memory as you
sometimes are with SQL.

Some considerations that come to mind: 1) Relational databases bring a form of
'automatic' documentation to a project in that somebody who hasn't touched it
before can expect to make a reasonable start on understanding it by using
known tools to look at the schema. 2) You get powerful hot-patching tools with
a relational database (sqlplus, psql, or similar) that have the safety of
things like foreign key constraints. 3) Major version upgrades. As you're
developing you can track db changes by writing change scripts. Then when you
do your upgrade you can 'pull the lever'. There's nothing stopping you from
doing this with any other structure, it's just something to think about.

~~~
shiro
Personal experience from AllegroStore, a predecessor of AllegroCache:

\- In AllegroCache/AllegroStore, the class definition _is_ the schema
definition; so if the newcomer understands class structures he understands the
schema. \- Hot-patching can be done through plain-old REPL. At least
AllegroStore the system took care of key consistency (if it is explicit to the
system). \- In AllegroStore schema change is handled as class change (of
persistent instances). So you can use usual MOP to write update function,
which corresponds to the change scripts.

The main difficulty, compared to RDBMS, seemed to come from the fact that the
stored objects directly formed a graph, not a table. Some people seemed to
have a hard time to "think" directly in graphs, and preferred table analogy.

------
mattjaynes
I've been using SQLite since it has the advantage of flat file storage and
also the power to run SQL queries against your data. It's very light and
powerful. But I was wondering about the specific setups other folks have used?

From: <http://paulgraham.com/vwfaq.html>

"What database did you use?

We didn't use one. We just stored everything in files. The Unix file system is
pretty good at not losing your data, especially if you put the files on a
Netapp.

It is a common mistake to think of Web-based apps as interfaces to databases.
Desktop apps aren't just interfaces to databases; why should Web-based apps be
any different? The hard part is not where you store the data, but what the
software does.

While we were doing Viaweb, we took a good deal of heat from pseudo-technical
people like VCs and industry analysts for not using a database-- and for using
cheap Intel boxes running FreeBSD as servers. But when we were getting bought
by Yahoo, we found that they also just stored everything in files-- and all
their servers were also cheap Intel boxes running FreeBSD.

(During the Bubble, Oracle used to run ads saying that Yahoo ran on Oracle
software. I found this hard to believe, so I asked around. It turned out the
Yahoo accounting department used Oracle.)"

~~~
brlewis
With Viaweb, the stores were islands, each dealing exclusively (or almost
exclusively) with its own dataset. They would not run into locking/contention
issues as they scaled the way some other apps would if they used flat files
and conventional locking.

The database landscape was different in 1995. There were no good free
relational databases, and no cheap Oracle licenses.

Lisp makes it very easy to write out any data structure. (Most Scheme
implementations can only write non-circular structures easily, but Viaweb used
CL). So if your app is amenable to using flat files, coding storage/retrieval
is trivial.

So basically, if you're writing an app with lots of separate datasets, flat
files are a viable option. If you're using Lisp, they're an easy option. If
it's 1995, they're a cheap option.

~~~
ntoshev
It is interesting how flat files can manage the evolution of the data
structures you use? Any advice?

~~~
vikram
Use hashtable for everything. So Person who has a name address age and
occupation is represented as a hashtable with those fields. When you want to
add a new field like emailaddress it's just another key in the hashtable.

Ideally you want to code the person accessors so that they still specific
externally

person-name person-age

the rest of the program uses these so it becomes easy to move person to a
different structure if that's what you want .

You want to write some function to read a hash-table and write on to disc. In
lisp this is trivial, but most languages have some sort of serializer. Then a
function to load all the hash tables up when you start up.

------
Tichy
I hope this is not totally embarrassing, but with the file system I am always
afraid of losing data. What if something happens to the server at the very
moment you change the file? It seems to me the only solution is to keep two
files and change them alternately, like the double buffering for graphics, and
that would probably be slow.

How do you file-advocates solve that problem?

What would you use for firefox extensions? I suppose I can't connect to a
database with Javascript.

~~~
aston
As of FF2, you can use the built in SQLite database for your Firefox
extensions. And yes, it's accessible from Javascript. The interface is a lot
like JDBC.

~~~
Tichy
Hey thanks, that is really good to know. I am just toying around with the idea
of writing an extension, and every now and then I collect new information. So
far it seems a bit scary to me, but there are so many cool things one could
make as extensions.

------
richcollins
I think it is a bit harder when the data is highly connected (social web app).

However, memory is so cheap now that you could probably get away with keeping
all of your business objects in memory, having them write-through to disk for
persistence. I imagine that you would have to move business objects around the
servers to decrease messaging latency as the connectivity between objects
changes.

I don't know of any frameworks that use this approach.

------
proj
You still have to implement queries, sorting, joining, etc. If all your
queries are simple key queries then you're basically done. You can also hash
at the network level by assigning group hash identies to specific machines in
your topology.

Where you will find a benefit to using a RDBMS is in heavy correlation,
multilevel sorting especially when you're talking about very large data sets.
This is one of the key problem that RDBMS address. If your requirements also
say anything about atomicity and integrity of the data it may be a better
investment in the long run to go with an existing database solution that had a
couple decades to work out those details for you.

------
vchakrav
The best non-database solutions I have seen are: 1\. Google file system, and
crawler cache 2\. Lucene storage

------
PindaxDotCom
Umm, you realize MySQL is free right? The real question is really why would
you even consider a non-database solution? Is ALL your data static and non-
relational? If not, you are really limiting your application by not using a
database.

~~~
edgeztv
As soon as you introduce an SQL layer to your app, your complexity goes to a
higher order of magnitude, making it that much harder to debug and
troubleshoot.

~~~
watfiv
That's hardly true in all cases. Depending on the situation, the "layer of
complexity" introduced by your having to build your own data structures is
often far worse than whatever is introduced by using an sql database. If
you're using a good ORM, it might not really be "complex" at all.

I don't mean this to argue that a db is always good, but I'd guess one of the
reasons that people use a database even when it's inappropriate is precisely
because it seems less complex.

~~~
wanorris
How can you build an application without data structures anyway? If you ever
load data from a database, it's loaded into a data structure of some kind.

Of course, if you're using a bunch of elaborate ORM stuff, I suppose it might
build your data structures without you ever really seeing them. Is that what
you meant?

~~~
watfiv
Basically: more precisely, that modern database APIs (not just ORMs) can hide
much of the complexity of using an sql database, so that it _seems_ neither
"elaborate" nor "a higher order of magnitude" more complex to the programmer
to build, test or debug. And since modern database layers often give you tools
to help build data structures, it might seem simpler to use those tools than
to build something without them.

------
weel
Metaweb (freebase.com) uses (or rather, is) a database, but not a relational
one. Those comments that implicitly identify "database" with "RDBMS" may
become outdated sooner than their authors expect.

------
pfedor
<http://labs.google.com/papers/bigtable.html>

------
cmars232
Domain-specific, but interesting RDBMS alternatives:

HDF5

Mnesia

Kx

------
rektide
mmap of course?

