

The missing opensource software for web development: a scalable database - antirez
http://antirez.com/post/missing-scalable-opensource-database.html

======
iamelgringo
I really think that this is the big plus for Google's app engine. Most web
sites are essentially front end GUI's for a database with a sprinkle of data
crunching thrown on top. It's all about the database. And, one of the biggest
pain points for a growing startup is scaling the database.

Enter App Engine. Google has spent years developing Big Table and their
underlying distributed OS, and they are probably years ahead of anybody else
on this one. Essentially what App Engine consists of is giving 10,000
developers read/write accounts to the worlds largest distributed database
management system.

Now, it's up to the developers to add their HTML/CSS/Javascript GUI's and a
little data crunching middle-ware all the world to use.

~~~
mechanical_fish
I cautiously agree, but I have two big concerns.

One is that Google's DB is both unique and closed-source. Worse, it isn't even
a product like Oracle or SQL Server that I can license and then use the way I
want. Even if I wanted to buy the hardware and maintain the software, I can't
move my app off of Google. I'm a sharecropper. They get my traffic data for
free, and if they change their pricing structure or the way their DB works I
have to scramble to keep up. And if they decide to screw with me (like, _ahem_
, becoming my competitor) I will have to port my app to some other scalable
platform _while_ serving all my accumulated traffic _and simultaneously_
fending off a deep-pocketed competitor.

I'm also suspicious of the idea that BigTable is some sort of magic wand that
solves any and all scalability problems. I'm very sure that BigTable scales
for the classes of problems that it was designed for. And I'm sure that a
wizard like Steve Yegge or Peter Norvig can adapt it to many other classes of
problem. But, without actually knowing anything about BigTable, I'm prepared
to bet that using it to scale your web app will require (a) knowing a fair bit
about how the tool works and (b) customizing your app's data storage scheme to
compliment the tool, after which (c) there will still be some corner cases
that don't work very well and require clever hacks and compromises.

In other words: I predict that in three years there will be Craigslist job
postings for "BigTable DBA with five years of experience".

------
neilc
The reason HTTP is trivial to scale and databases aren't is that the HTTP
daemons are largely stateless, while databases are all about managing state.
Doing that in a scalable, reliable, consistent way is just a fundamentally
hard problem. Oracle and DB2 haven't done particularly well at trying to solve
this (within the constraints of a traditional RDBMS), let alone the various
open source projects.

I spent a while trying to build a synchronous multimaster replication system
for Postgres. I think we made two main mistakes:

1\. Trying to provide the traditional ACID semantics that people expect on a
single-site DBMS isn't feasible, at least without incurring a very significant
performance and complexity overhead.

2\. Horizontal partitioning is _key_. If you make it easy for the user to
partition their data, you now only need to maintain consistency over a single
partition.

------
wvenable
Only 0.001% of startups (and I'm being generous) need the sort of power being
discussed here. The reason "a scalable database" doesn't exist is because
there isn't a big enough market for one.

I have no doubt we'll hear more about this in coming months, not because of
any specific need, but because it's the latest fad.

~~~
SwellJoe
I was just talking about this with a YC applicant yesterday. Scaling up HUGE
is a really interesting topic, and a lot of nerds love to talk about it...but
the number of sites that actually require extreme scaling is very low. When we
first started selling Virtualmin we had a lot of early adopters asking about
database and web app replication, load balancing, etc. Not because the folks
asking actually had sites that needed that kind of performance, but because
they're cool technologies to play with. A thousand paying customers later and
the demand for those features has dropped to background noise (the same early
adopter folks who mostly just want to play with it rather than have the
traffic to justify it). Far more of our users are asking for the ability to
run more (more sites, users, mailboxes, applications, etc. and not generally
more reqs/second) on a single server rather than the ability to spread load
across many servers.

The performance of hardware has managed to keep pace with the needs of the
vast majority of websites over the years, to the point where very few sites
(like the top 500 or so) actually ever need more than just the basic scaling
ideas that are easy for just about any sysadmin to implement (split mail, web,
DNS, and database onto independent boxes, use memcached, maybe a web load
balancer like Squid or pen, etc.).

~~~
davidw
Could it be that the people who select Virtualmin aren't the same people who
need to scale up? Not a rhetorical question, I don't know you guys well enough
to have an idea.

> The performance of hardware has managed to keep pace with the needs of the
> vast majority of websites

That makes sense, but the absolute number of web sites that need to scale up
in a big way is also growing, even if it could actually be shrinking as a
percentage.

~~~
SwellJoe
It could very well be a bit of a situation where, "We don't offer a scaling
solution, so customers that need to scale don't come our way, so we don't hear
from customers that they need to scale."

But, it's worth noting that my previous (now defunct) company was entirely
devoted to web performance and scalability. It's not a volume business--the
folks who need it spend a lot on the problem, but their just aren't that many
who need the extreme solutions. I've often chuckled when the few Virtualmin
customers who do want scalability have explained their requirements and they
match to a great degree the products I was building five years ago (and found
to be a niche that wasn't worth continuing to expend effort on). If I thought
it would be profitable to pursue, I could certainly revive some of those
products in the context of Virtualmin...but I think there are far more
profitable areas for us to work on.

Everything related to the web is growing, so all ships are rising, including
scalability issues...but I believe several others are rising much faster.

------
antirez
Don't needed? Even the book of Founders at Work is full of stories where the
problem was scaling (delicious, paypal, blogger, ...). Every startup that's
going to work well is going to have this kind of scalability problem (and it
is most of the time related to DB and not HTTP front-end side).

------
SwellJoe
While I haven't used either (or BigTable for that matter), I was under the
impression that both HBase (the Hadoop database) and HyperTable (
<http://www.hypertable.org/> ) were Open Source competitors to BigTable. But
there does seem to be some work happening there.

It's pretty rarefied air up in the high scalability world...my previous
company built website acceleration tools, and the customer base there is
pretty small (they have plenty of money, but it's not a high volume business).
As much as us nerds like to think about HUGE problems like this, it's a
problem that just doesn't come up that much.

------
wmf
Maybe Sun will dust off Clustra from whatever closet they left it in. Or maybe
H-Store will come out soon and support a large enough subset of SQL to satisfy
people.

~~~
bayareaguy
I don't have any direct data about Clustra but I read the Hypra/Clustra
book[1] a while ago. It uses a tightly coupled scheme of parallel hardware,
software and networking to provide high availability. Scaling that sort of
system will cost you a lot more than running H-Store on a cluster of cheap
boxes.

[1] -
[http://books.google.com/books?id=KH_kfSsMu80C&printsec=f...](http://books.google.com/books?id=KH_kfSsMu80C&printsec=frontcover&dq=HyPra+database&source=gbs_summary_s&cad=0)

------
lg
I'm interested in this myself, if only for coolness as SwellJoe mentioned. The
comments on there mentioned CouchDB, does anyone know if it's worth using yet?

------
JaredRad
Check these out: hypertable.org cleversafe.org

