

Databases for startups - mcgin
http://hackfwd.tumblr.com/post/2940338205/build-0-4-talks-databases

======
space-monkey
I see over and over again that "Relational databases are not easily scalable.
It will require a lot of efforts to build a scalable and redundant setup." I
would like to see more mention of how many users/rps/etc you can handle with a
maximally sized Amazon RDS redundant instance.

Scalable is different than infinitely scalable. Scalable used to mean, for
example, being able to get a close-to-linear speedup from multiple cores, more
or faster disks, etc.

~~~
saurik
I'd personally say that one of the key things that people have to realize when
going into the question "what can I accomplish with these tools" is that you
have to understand how the database technology works to get reasonable
performance out of it.

So, if you are using PostgreSQL, you really need to know a little about
journalling, B+-trees, multi-version concurrency (and the associated vacuum),
their "heap-only tuples" update strategy, and settings like
synchronous_commit.

Likewise, if you are using Cassandra, you really need to know a little about
LSM-trees, inverted indices, eventual consistency, the purpose behind column-
family storage, and how their read-to-write ratio affects all of consistency,
durability, and performance.

Outside of the specific database technology, if you want your data to actually
be there when the shit hits the fan, you absolutely have to understand write-
back vs. write-through cache behavior and how and where to apply cache
barriers (and what tools work correctly for them).

Finally, your specific hardware is going to drastically affect how much
performance you get for your specific application, as random/sequential
read/write I/O performance is going to drastically differ between technologies
and how that matches up with the read/write ratio and locality of your
application.

If you don't spend the time to learn these things, you are seriously just
going to get burned. We (as a civilization) simply do not have the science and
theory yet to make the practicalities of setting up and maintaining a database
server a totally seamless and simple process with well-understood performance
characteristics unless you constrain absolutely every single variable.

Unfortunately, learning a lot about how these algorithms work is really hard.
I mean, PostgreSQL HOT is almost undocumented at the user-level: you have to
drop to README.HOT from the source distribution to see what the performance
characteristics are.

Meanwhile, there is a ton of misinformation out there. :( I haven't watched
much of this video at all, but within 10 seconds of skipping around in it I
saw that this person claims that relational databases force re-writes during
schema updates, which is not true for the majority of updates that developers
actually try to make.

Finally, half of this stuff is really only understood well at the academic
level: to really get an understanding of why your particular load causes
PostgreSQL to slow down you might end up reading academic papers from the 80s
and 90s on the core algorithms we use to do data storage and indexing.

In the end, though, I'll say that most of my database needs are being served
right now by a single server for an application that is getting a million
users every day (with ten to fifteen million or so users total) with a ton of
room to grow.

That said, two core things that I value in my logging are currently being
stored to S3 directly (of all places, which is a ludicrous thing to use a
database really), and while I am pretty certain they will work great on the
new database server, I'm not entirely positive.

(For the record, my architecture is a m4.4xlarge running PostgreSQL with three
EBS disks, one with ext2 for the WAL, two in md RAID0 with xfs for the data,
where a couple heavily updates tables are set with a 50% fillfactor, and all
non-durable writes are done without synchronous commit. I have a two-level
external pool, with pgbouncer running on both the application servers and on
the database server.)

(edit: I said RAID1 when I wanted RAID0. RAID0 increase your random I/O
performance on EBS, which is useful for most transactional database server
loads; given snapshotting and the inherent durability of EBS, RAID1 is not
necessary and will just saturate your network I/O)

~~~
lwat
Data is hard. Let's go shopping!

------
pmikesell
We've built a clustered, infinitely scalable, fault tolerant, fault-tolerant
database system that supports full relational and ACID semantics. It also
emulates the MySQL protocol, so it's very drop-in. <http://www.clustrix.com>

(sorry to sound like an ad, but we really do have a great solution for this)

