

What is the best approach to web app development in Python without a RDBMS? - dood

Having read a couple threads [1] on dropping the RDBMS in favor of keeping data in RAM and logging to disk, I'm wondering what is a good setup for a typical Python web app.<p>There seem to be a number of options; memcached, BDB, ZopeDB, metakit, serialising into sqlite...<p>Any ideas?<p>1. [<a href="http://news.ycombinator.com/item?id=14605" rel="nofollow">http://news.ycombinator.com/item?id=14605</a>], [<a href="http://news.ycombinator.com/item?id=16098" rel="nofollow">http://news.ycombinator.com/item?id=16098</a>]	
======
Hexayurt
<http://itamarst.org/software/cog/>

Cog is the Checkpointed Object Graph object database, providing semi-
transparent persistence for large sets of interrelated Python objects. It
handles automatic loading of objects on reference, and saving of modified
objects back to disk. Reference counting is used to automatically remove no
longer referenced objects from storage, and objects will be automatically be
attached to the database if a persistent object references them.

======

I looked this package over a few years ago, and I think it got an awful lot of
things right... but not enough.

It's worth examining the design if you want to understand the intricacies of
non-RDBMS approaches. A lot of thought went into it.

------
jey
Depends, what are your requirements or what does your program do?

RDBMSes are popular because they're easy and they perform decently for most
answers to the previous question. And it may be that an RDBMS _is_ the best
answer for your application.

~~~
dood
I should have been clearer in the question, I'm not looking for a solution to
a specific problem, I'm looking for a general approach to building this kind
of system, or an understanding of the benefits and trade-offs of different
methods. I understand why RDBMS are normally used, and am comfortable with
SQL, but I'm interested in the alternatives.

~~~
jey
Start simple: store the data in-memory in whatever way makes sense, and just
write a log file that contains pickled transactions that can be played back.
Conceptual sketch of how I would do it follows. This isn't "the" way to do it,
just "a" way to do it.

    
    
      # this is the utility method used by the rest of your code 
      # to register a new user. maybe this is at module scope in
      # the User.py module
      def register_user(user):
         run_transaction(AddUserTransaction(user))
    
      # this is the actual object that represents the
      # transaction
      class AddUserTransaction(Transaction):
         def __init__(self, user):
            Transaction.__init__(self)
            self.user = user
    
         # all Transaction objects have an apply() method
         def apply(self):
           Transaction.apply(self) # invoke base method
    
           # the following MyGlobalUserTable probably just
           # stores a couple dicts as indexes over the user info
           # e.g. it will have a dict by username, by user ID, etc
           MyGlobalUserTable.add_user(self.user)
    
    
    
      def run_transaction(t):
        # apply the transaction first, then write to the log.
        # since if it crashes while running the transaction,
        # you don't want to crash again when you play back the log.
        t.apply()
        transaction_log.append(t)
    
    

Here transaction_log.append(t) is some method that will pickle the transaction
and append it to some log file. You'll have multiple classes like
AddUserTransaction, all derived from Transaction. When you crash and want to
play back the transaction log, all you have to do is unpickle the Transaction-
derived objects and call apply() on them in the same order.
<http://docs.python.org/lib/module-pickle.html>

Caveats:

\-- This will eventually get to have too large a startup cost to play back all
the transactions, when you have a huge number of transactions. You can fix it
at that point, and you can just replay all the transactions to get the data
imported into the new format.

\-- The above approach is horrid if you want to launch a new process for every
single request. You'd have to replay the transaction log for each incoming
request, you'd have a data coherency nightmare, etc. So if you want to use
this approach, make sure you're using something that shares one process
amongst all requests.

You might also need to worry about whether your server is multithreaded and in
that case deal with locking and other crap. I'd suggest going simple with
something like Twisted's single-request-at-a-time approach. This isn't as bad
as it sounds at first; you just write everything to finish ASAP, and if
there's something you need to come back to, you just request the Twisted
Reactor to schedule an event. If you have some big blocking thing to do, do it
in another thread then notify the Reactor when it's complete (and have the
Reactor schedule an event for your code to be notified that the big blocking
thing finished).

As you can see, it's not totally trivial and general purpose like the RDBMS +
ThingThatGeneratesNastySQLFromObjects approach (aka ORM). But you get more
flexibility in your interfaces, and lower overhead. While it can be rewarding
and simplifying to work in this way, it could also can lead to over-
engineering and/or lost time and effort if you aren't already intimately
familiar with the steps in this approach.

~~~
mechanical_fish
The great thing about this post is that it doesn't hide the fact that
"building a site without an RDBMS" is perilously similar to "writing your own
buggy, half-implemented, slow, nonstandard DBMS".

Building a custom DBMS has been done, and done well, but I think a good
general approach to the problem is to read the line about "over-engineering
and/or lost time and effort" out loud, to your entire team, at dawn and noon
and sunset on every day of the project.

~~~
jey
> _is perilously similar to "writing your own buggy, half-implemented, slow,
> nonstandard DBMS"._

People have used files and in-memory data structures just fine for a long
time. I don't they had bugs in their code owing solely to the fact that
RDBMSes hadn't been invented yet.

I also don't see how this is slow; it's all in-memory. Why bring a big honking
DBMS into the picture when all you wanted was a hash table?

Storing data in-memory doesn't amount to a "buggy, half-implemented, slow,
nonstandard DBMS" -- it's serving an _entirely_ different set of goals.
Storing data in-memory in data structures is how programming is done. If you
actually started to make the interface to your data layer as horrid as the
interface to most DBMSes ($DIETY help you), you'd definitely end up with a
shitty buggy half-implemented DBMS. But if you just want to store a couple
hash tables with a sane interface dictated by the software design, not by your
DBMS, just store the hash tables! Don't go wrestle with SQL just because
that's the buzzword in-vogue.

You should measure your DBMS sometime, and consider it in terms of a hash
table lookup. It starts looking like a comedy of horrors: first, generate a
string in an obtuse language to perform the lookup, and oh wait, don't put a
quotation mark in the wrong spot! Got that string generated? Now send it over
a _socket_ to a server. Now the server is going to parse your string into an
AST, turn the AST into an internal representation, then it's going to _guess_
how to optimize your query (this is also expensive in terms of cycles). When
it's done optimizing, we run it to the query evaluator which looks in memory
(since your table is so tiny anyway) and pulls out your hash table value,
encodes it, and ships it back over the socket to you.

I agree and concede that these days you'll fit into the ecosystem better if
you do use an RDBMS with no rationale (other than that it's the de facto
standard), generate giant SQL strings from objects, and most worryingly from
my perspective: deal with the big _impedance mismatch_ between the two
paradigms of OO/anything and RDBMS. [http://en.wikipedia.org/wiki/Object-
relational_impedance_mis...](http://en.wikipedia.org/wiki/Object-
relational_impedance_mismatch) I haven't seen a single ORM implementation that
really makes it truly simple. You end up manipulating your structure at a
lower-level than you'd like to naturally.

This is partly why my next project isn't being done with an RDBMS, and I'm not
storing it just in memory because of the volume of data involved. I'm going to
be using Erlang and its bundled Mnesia database. There's no impedance mismatch
there. The whole thing, including the database interface, works the way Erlang
works.

I'll throw a party when RDBMSes die.

~~~
mechanical_fish
> _I also don't see how this is slow; it's all in-memory._

Are you _not_ writing to disk on every single transaction, then? My bad. And
good luck keeping that power cord plugged in.

> _Why bring a big honking DBMS into the picture when all you wanted was a
> hash table?_

I didn't bring that DBMS into the picture. You did, in the second half of your
first sentence:

"Start simple: store the data in-memory in whatever way makes sense, _and just
write a log file that contains pickled transactions that can be played back._
"

Indeed, an in-memory cache _is_ really fast. But it's unfair to compare it
with an RDBMS, which is handicapped by the need to write every transaction to
disk before it can be committed. The hash table is great only up to the point
where a stray cosmic ray crashes the server and makes the whole thing
disappear, after which you realize that you need a log file.

In the general case, writing a transaction log file is a hard problem. If you
write a really robust tool for managing that log file - a tool which is
efficient at reading and writing even when the number of transactions grows
large; one that lets you specify when the log gets written, and how often, and
whether the system can be queried during the write, and how long those queries
will block, and what they will return; one which allows multiple threads and
multiple machines to read and write the data without concurrency problems; one
which prevents the in-memory cache from getting out of sync with the
filesystem - you will have implemented a substantial portion of MySQL, and
probably _memcached_ as well.

It is easy to start out designing a fast, simple, non-transactional DB and end
up reinventing MySQL. If you don't believe me, ask the folks who invented
MySQL!

In certain special cases (e.g. Google), rolling your own persistent storage
system is a big, big win. You may _know_ , in advance, that your website is
one of those cases. If you are correct, you will be a superstar - you will
build a relatively untested, nonstandard data storage system with a tiny
subset of PostgreSQL's features, but all of that will be worth it because the
system will be fast. If you are wrong, you will work for a month or six and
then end up installing PostgreSQL anyway. In fact, even if you are right, you
will end up installing PostgreSQL after your customer changes the spec at the
last minute to require some boring standard feature - like a shopping cart -
which any CRUD jockey can build in a day but which your "simple and elegant"
database doesn't support because it wasn't in the original spec.

So it's no surprise that the "de facto standard" is to build your site around
an RDBMS, get the damn thing working, and optimize later - and that, as a
result, the typical in-memory data structure ends up being backed by an RDBMS
instead of by a "simple" log file.

As for your ire at SQL... if you think you're in pain, just imagine how the
designers of SQL must have felt back in the 1970s, when string parsing and
query planning were _tens of thousands_ of times slower than they are now, an
extra database server cost _more_ than a coder's daily salary, and RDBMS
software was very, _very_ non-free. It was a dark time. And yet for some
reason those guys abandoned their efficient hand-rolled binary databases for
SQL. In fact, they did it so fast that Larry Ellison grew richer than the
Beatles. Why did they do that, I wonder? It must have been the drugs.

------
codeslinger
It sounds to me like the closest thing to what you want is a library for
Object Prevalence
<[http://en.wikipedia.org/wiki/Object_prevalence>](http://en.wikipedia.org/wiki/Object_prevalence>).
There is one that I know of for Python called Pypersyst
<[http://pypersyst.org/>](http://pypersyst.org/>). There is also an IBM
DeveloperWorks article about this library by one of the authors, as well
<[http://www.ibm.com/developerworks/library/l-pypers.html>](http://www.ibm.com/developerworks/library/l-pypers.html>).
Hope that helps :)

------
palish
You might want to ask yourself, "Why am I choosing to not use an RDBMS?"

If the answer is "because I know it will save time" or "because I'd like to
explore the possibility", more power to you. I wholeheartedly encourage it.

If the answer is anything else then you should just use one, because that path
will save you time (possibly a lot of time).

Check out Rails. The ORM is so well done that it feels like you aren't using a
database at all.

~~~
dood
Yes, I'm pretty much checking out the solution-space as a whole. I'll probably
end up slinking back to SQL, but at least I'll know why.

------
mattculbreth
What kind of app are you building?

I don't think you'll find a pre-packaged framework out there that works
without a DB. I guess if I were doing this I'd use memcached and then I'd
write to a shared file system somewhere for the writes.

We use Pylons and SQLAlchemy/Elixir/PostgreSQL though and we're very pleased.

~~~
dood
I wasn't really after a pre-packaged framework necessarily. I'm also very
pleased with Pylons so far, getting a non-RDMBS thing to play well with Pylons
could work out very neatly.

------
mxh
Twisted is an interesting framework, though a little lower-level than others.
I think you could build a self-contained, in-memory solution around it, as DBs
aren't particularly central to its operation.

------
kashif
schevo.org integrates well with pylons

~~~
dood
schevo does look interesting, I'll have to give it a closer look. Though as
far as I can tell it isn't widely used... how is your experience of it?

~~~
kashif
The documentation is limited. But it is very powerful and easy to use. The
support on the IRC is excellent although the response time is around a day.

