
A New Approach to Databases — Simulating 10Ks of Spaceships on my laptop - pron
http://blog.paralleluniverse.co/post/44146699200/spaceships
======
riobard
Correct me if I'm wrong, but doubling the performance of the simulation from a
dual-core to a quad-core processor doesn't necessarily mean you have ”linear
scalability” unless you can demonstrate the trend continues as more cores are
used (e.g. 8-, 16-, and 32-cores).

I think the ideal way to demonstrate this kind of claims while eliminating the
effect of other factors such as memory performance is to use a beefy server
with many cores (let's say N >= 16), and adjust your benchmark to use 1, 2, 3,
..., N cores, then plot the line to see if it's really linear.

Two dots make a line in geometry, but you need more dots to project the trend.

Time to spawn an EC2 Cluster Compute Eight Extra instance!

~~~
pron
You are absolutely right, but we'll do that as part of the next step, when we
show scaling beyond a single machine. Big servers, and more than one.

~~~
riobard
Cool! This looks very interesting!

------
qznc
It seems like the author considers 10K spaceships a lot. That is something
like 1MB of of data. Not that much, even for a single core.

Also, there should be multiple phases like, first everybody shoots, then
everybody is hit, then blast force is applied. Otherwise there might be an
unfair advantage for some spaceships, which can destroy their opponent, before
it had the chance to return fire.

~~~
jpollock
It doesn't sound like a lot if it is processed linearly. However, 10,000
reader/writers on a single shared resource is a lot.

Naive implementations would have a single global lock on the resource,
resulting in lots of contention. They appear to be saying that by converting
to callbacks, they are able to determine which transactions will interfere
with each other and limit the contention to just those portions. Deadlock
would be a serious possibility here, same as in regular databases.

The size of the data is largely unimportant for this problem, it's the
contention that's the issue.

If the spaceships didn't have to interact through the global state then you
could have sharding. You can change the problem by adding "planetary systems",
and saying that ships in one system can't interact with another. Instant
shard.

~~~
pron
Right. That's the whole point: shards are a solution that imposes constraints
on the application, meaning it's not really a solution.

Also, SpaceBase guarantees no deadlocks.

~~~
jpollock
According to the documentation, it does this by preventing callbacks from
initiating transactions on the same SpaceBase store?

:) :) :) Isn't that a constraint on the application? :) :) :)

I think I'm going to have to look at the demo and see how a shot which
triggers an explosion then does the query to perform the push of the
surrounding ships.

I would expect something like:

    
    
      Ship A:
         Query(origin+direction+range, callback Shot())
    
      Shot()
         if Exploding
           Delete
           Query(origin, range, callback Boom())
    
      Boom()
         Update location
    

Except that's not allowed?

An AOE which causes cascades of death would be cool to see too.

Nifty idea.

~~~
pron
I guess we must have missed all the places in the documentation where this is
mentioned. As you see in the code, this limitation is no longer in place.

------
threepipeproblm
"But if the new Moore’s law gives us exponential core-number growth, to
exploit those cores we need to grow the number of shards exponentially as
well, meaning we’re placing and exponential number of constraints on an
application that probably only needs to accommodate a linearly increasing
number of users, and that’s bad."

I'm no expert on parallel programming, but isn't this sloppy thinking? Like
the implication is that we must have an exponential increase in cores in order
to accommodate a linear number of users...

More generally is anything about this approach really novel? Databases can
already run transactions and queries asynchronously. Any language that
supports concurrent programming can take advantage of this, even if the
database isn't issuing a callback. And ruling out concurrent languages that
are capable of implementing this approach, but include some language features
the authors don't like seems like a side issue.

Not just trying to snark... if someone can explain what I'm missing I'm open
to it!

~~~
pron
> Like the implication is that we must have an exponential increase in cores
> in order to accommodate a linear number of users...

As the article says, you can just ignore those extra cores. But, if you want
to do cooler stuff with your data you'd better use them, and if so, sharding
is a dead-end.

> More generally is anything about this approach really novel?

I said it's a repurposing of an old idea, that of the old RDMSs where the
application was the database. The idea is not about concurrent transactions
but about uniting the application and the database and letting the database
run and parallelize the application's logic, and be the engine of scaling. I
think most people see the database simply as a data storage component, and try
to make it fast and scalable enough. We say, let the database run the
application.

> And ruling out concurrent languages that are capable of implementing this
> approach...

I haven't ruled out any language. In fact, the demo is in Java, which
certainly doesn't qualify. Any language can do this. I only said that in order
to go forward, in order for the common programmer to take advantage of modern
hardware, we must use languages built specifically for modern hardware. I've
named two that I think fit the bill, but if other languages end up becoming
the language for the new age - it's all the same.

~~~
threepipeproblm
Thanks for your response. I think I understand your point re: cores, although
I wouldn't have used the argument quoted above, instead I would just make the
point in your response - we have all these cores, gotta use them.

Since I found some things to nitpick I want to say I definitely like the
approach you are coming at this from -- in terms of how you apparently view
sharding, NoSQL, database as a processing engine, etc.

But I still think you are presenting it as something novel (just look at the
title) when what you are really doing is promoting a long-established but
sorely neglected philosophy... a _good_ philosophy, which needs to be
promoted. The novel aspect seems to be supporting callbacks in your Spacebase
product, which isn't particularly essential to this philosophy.

Thanks for the article.

~~~
pron
I think callbacks are absolutely essential. How else is the db supposed to
schedule and run your business logic?

~~~
threepipeproblm
Okay I have been thinking about your response for a while...

Database engines already possess sophisticated scheduling logic for query
processing. We are constantly taking advantage of that, in different ways,
when we delegate to the database.

And even though we are supposed to have all these business layers and whatnot,
I think centralizing logic in a DB is a pretty common practice.

So I have been asking myself, is there a significant functional difference
between the db doing a callback to the program, versus other methods of
interacting with that built-in scheduling engine?

My supposition is that in the absence of such a capability, one could always
get mostly (or fully) equivalent behavior by some combination of

(A) having the app fire queries (and consume their results) asynchronously

(B) giving queries some other means of initiating a response, e.g. writing to
a table that acts as a message queue.

To me it seems like such a transformation wouldn't even change the balance of
code between app vs DB all that much, it would just change the calling
mechanism, and maybe you need these couple of extra modules to manage your
query submissions and asynchronously dequeue stuff.

So I admit this qualifies as more than syntactic sugar, but not sure goes much
beyond that. Of course, many prized language features don't go beyond that at
all; and calling/concurrency idioms are important.

I like the idea of centralizing logic in the DB and I can see how your
approach allows this in a much more elegant and straightforward manner. So I
think I get it. Whether it's a totally new architecture, I dunno...

I come out of this thinking if there was a way to improve your article it
would perhaps be to avoid the claim that this is allowing us to take advantage
of database to "schedule and run" whatever business logic is implemented in
the db via the highly optimized query processing facilities. All programs
effectively do that in their use of the database, which is why the scheduling
capabilities exist in advance of your approach.

The core thread of what you are doing/advocating, as I understand it, is
something like this. (a) Centralize business logic in the DB. (b) Stop being
ashamed of it. (c) Support calling conventions that make it easier.

If you made the argument that way, I would have seen it as a slam dunk. And I
would expect that to be a controversial argument on HN.

Interested in your thoughts.

(BTW, I bet one could do these sorts of callbacks using the relatively recent
capability of SQL Server to run Stored Procedures written in .NET.)

~~~
pron
Yes, well, you summarize most of my points but one: the importance of
callbacks is not just in the database triggering application actions (as would
be done with asynchronous queries or the message bus you mention), but by
scheduling application code to run on appropriate CPU cores.

Our ultimate goal isn't to make a new programming model for its own aesthetic
(or non-aesthetic) sake, but in order to take advantage of multiple cores --
namely, for the sake of performance. So the database does not "control" the
application -- it actually parallelizes it.

And as for SQL server, yes, I guess it's a similar concept but a different
kind of database. I advocate that the database should be part of the
application; they should be sharing the same heap in the same process, and
only then should the applications let the database "drive". That is why I also
don't quite agree with your wording of "centralizing logic in the DB" because
the DB and the application are one; certainly the DB is not more (or less)
centralized than the app. Once you get that, there is no point in being
ashamed of it. Just like you let your web-app container parallelize your
presentation layer, you let the database parallelize your business logic. In
both cases the middleware shares the same process as the app, and the leap of
faith required isn't big at all.

~~~
threepipeproblm
Thanks for your response! I think it addressed my comments well.

It does make me think about the fact that _almost_ every major DB today runs
out of process and that the market seems to have selected this pretty
definitively. It seems like only recently some products are running in-process
for performance reasons. But if I understand the trade offs here, only a
single application can run in-process with the db. Also the db is not
protected from crashing when the app crashes. But this is the 1990's
explanation... any thoughts on how that applies to your situation?

~~~
pron
> _almost_ every major DB today runs out of process and that the market seems
> to have selected this pretty definitively.

True. That's why I've tried to show the performance to be gained by letting
the DB become integrated with the app. This is a result of the DB having just
the right kind of knowledge about the relationships among domain objects to
allow it to parallelize business logic.

> But if I understand the trade offs here, only a single application can run
> in-process with the db. Also the db is not protected from crashing when the
> app crashes.

Well, if the application and DB are one, so what? Note that we're only talking
about OLTP DBs here. I don't see much of a reason for analytics DBs to be
unified with the app. How often do you need more than one app working with the
same OLTP data?

However, if you absolutely must, there is no reason why the in-process DB
shouldn't store its data in shared memory, thus allowing other DB instances to
access the information. There are ways to ensure that if one app crashes the
memory is left in a consistent state.

~~~
threepipeproblm
>> How often do you need more than one app working with the same OLTP data?

If you are running a modern web startup, it's probably not an issue. But in a
typical business... often IMO. And it was one of the original use cases that
drove the adoption of databases in the first place.

Would be interested in seeing you blog about the last sentence of your last
reply.

Thanks for engaging my comments.

------
SCdF
OT: your overlay that appears at the top of the screen if you scroll down far
enough means that when you click a footnote you can't actually see the
footnote without scrolling up a little.

------
notdonspaulding
I'm confused. Is the author trying to convince me to give up the expressivity,
portability, and readability of my Python app-layer codebase for the
performance gains to be had by writing my logic in stored procedures?

Having moved one codebase _OUT_ of T-SQL stored functions and onto Django's
ORM, I can say that this article does not have me convinced the way forward is
to move the code back _INTO_ the DB. If my data model is going to be dictating
my development toolset, it better be giving me some amazing performance gains
_and_ be as good as the toolset I've come to enjoy in a post-ruby-on-rails-
world.

What am I missing?

~~~
pron
You don't have to give up anything. These "stored procedures" are not what you
think they are, and neither is the database. The idea is that the db and the
application become one - not that you have to program in plsql or something.
You could continue to use python or ruby.

And yes, the performance gains are amazing.

------
cpressey
OT, but would you like some constructive criticism on usability of the
article? The lack of contrast made it difficult to read for me; clicking on a
footnote scrolls in such a way that the footnote text is hidden by the header;
and having a link that says "Discuss on Hacker News" and _also_ a comment
section under the article sends a mixed message.

------
robertfw
What are the pricing options for spacebase? I can't seem to find them

~~~
dafnap
email us at info@paralleluniverse.co and we'd be happy to give you the
details.

~~~
mikecx
Ugh, maybe i'm the only one that hates this but if I have to spend time/effort
just to get a price for something it's already off the table as a choice.

This makes it seem like it's either too high priced and you know it or the
pricing scheme is too complicated and should be simplified.

~~~
pron
No, we actually kinda hate it, too :)

It's just that as a young startup, it's very important for us to learn about
our customers. At this stage, it might be beneficial for us to offer you a
fantastic deal if we think we there is much to gain by the partnership. Also,
learning about the problems our customers face is more important to us than
sheer sales volume at this stage. That's why we want you to talk to us. But
we're sorry for the inconvenience.

~~~
sethrin
I work freelance. When someone wants a price from me, they want a number. It
might not be the number they want to hear. It might not be an accurate number.
A lot of the time, it won't be the actual number used, and it will probably
not have any relation to the final invoice. There are an infinity of reasons
why you might not give someone a number when they ask for a price, and they're
all irrelevant.

It's okay to qualify the number. "Ten million dollars, but order now and..."

If you'll note, your customer just said the same thing I did. Listen to him.

------
davidmr
The problem sounds not dissimilar to hydrodynamic simulations, for which
people have been using MPI for >20 years. I'm curious why you didn't use it.
Can you speak to this?

~~~
pron
Because our purpose was not to show how fast a simulation can run. We haven't
tried to come up with a new approach to simulations. We've tried to present a
new approach to databases, and show an example where a _naive implementation_
with the right database can have decent performance.

Also, the very same code would work if actions were triggered by a network
request rather than by a main loop. We don't assume anything about how the
spaceships are coordinated. If you coordinate them to run in lockstep, you
could get better performance. But, again, we've tried to show the viability of
the naive implementation, where nothing is assumed.

------
capkutay
Is "Real time spatial database" just a fancy word for another distributed, in-
memory datastore? There are many implementations of distributed data
structures, and many are used for in-memory analytics. I realize spacebase
also minimizes remote lookups/puts in the average case. I just want to
understand the "spatial" aspect they try to integrate into their product.

~~~
hendzen
The spatial aspect comes from the ability to efficiently answer queries such
as "what is set of objects within the following cube?" , etc. Look up R-Trees
if you're curious.

------
lootsauce
I'm no database expert but won't the limitations of scaling across machines
largely limit the benefits of this system?

~~~
pron
Probably not -- if you're using the right data structure, that is. I've
written about the theoretical performance here,
[http://highscalability.com/blog/2012/8/20/the-performance-
of...](http://highscalability.com/blog/2012/8/20/the-performance-of-
distributed-data-structures-running-on-a.html), and pretty soon we'll publish
some empirical results.

------
belorn
I wonder if this type of approach might push programming language developers
to integrate databases into the language itself beyond just giving the
programmers a interface for the database.

------
pc86
This is more a UI/theme issue than anything else, but clicking a footnote
places that citation behind the sticky header/search bar in FF17

------
papsosouid
>First, this means using a programming language built for concurrency. At
present, there are two such languages in wide(ish) use: Erlang[2] and Clojure.

How exactly does clojure qualify while go and haskell do not?

~~~
pron
Oh, Haskell qualifies, it's just not in wide(ish) use yet. Also, I don't know
whether it supports mutable concurrent constructs (as Clojure does with refs
and STM). Without those you don't have concurrent writes.

As for Go, it allows sharing of mutable state, and so does not prevent races
and does not make it clear when modifications to said mutable state are
visible. It supports concurrency well but is not built around it.

~~~
BrokenEnso
The RedMonk ranking from September 2012 [1] seems to contradict your statement
about Haskell prevalence. Note, it edges out Erlang on both metrics.

[1] [http://redmonk.com/sogrady/2012/09/12/language-
rankings-9-12...](http://redmonk.com/sogrady/2012/09/12/language-
rankings-9-12/)

~~~
fghh45sdfhr3
What is RedMonk, and how is their ranking less B.S. than rankings according to
Netcraft?

------
eitan101
.

