
Building a Database System in Academia - llimllib
http://www.cs.cmu.edu/~pavlo/blog/2017/03/building-a-new-database-management-system-in-academia.html
======
dhd415
The database he's referring to is Peloton
([http://pelotondb.io](http://pelotondb.io)) and it appears that its "killer
feature" is that it's self-managing or self-tuning for highly concurrent
workloads. That's certainly an interesting idea. I wonder how closely that
kind of functionality will parallel the logic in superscalar CPUs where
instructions are run in parallel so long as those instructions can be
determined to be independent of each other. In practice, that usually works
well, but it can work even better with programs specifically designed for
superscalar CPUs. In other words, I wonder if Peloton will offer reasonable
performance improvements for concurrent workloads, but at the very high end of
the spectrum, specific design and/or tuning will still be required for maximum
performance.

The article mentions that it's currently single-node but multi-node support is
planned. If it's a fully ACID-compliant database with support for horizontally
scaling with multiple nodes, that's an area where there's plenty of room for
another database.

------
marknadal
Yes, we need this - we need more people exploring and building databases. The
more people we have tackling interesting problems the better solutions we will
develop.

No, building a database is not hard. We need to stop propagating this message,
because it deters newcomers from trying. And the less people that try, the
more of an echo chamber we get (like everybody building on top of Postgres
again and again), with no new ideas. This comment on another current HN
homepage article sums it up the best:
[https://news.ycombinator.com/item?id=13933572](https://news.ycombinator.com/item?id=13933572)
.

I'll share my story here, since I seem to be an outlier:

\- In less than 2 years from building a database from scratch (no forking
other database), we had a prototype that could save 100M records for $10/day:
[https://youtu.be/x_WqBuEA7s8](https://youtu.be/x_WqBuEA7s8) .

\- In 2 years we had a system that can outperform Redis on the same test on
the same hardware: [https://github.com/amark/gun/wiki/100000-ops-sec-in-
IE6-on-2...](https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2GB-
Atom-CPU)

\- Within 2.5 years we've built a system that can do on low end hardware
across a federated system 1750 table inserts a second (we hope to get this up
to 10K inserts/sec across a distributed system), a slower demo here:
[https://www.dropbox.com/s/amjlr5gqk23et51/load.gif?dl=0](https://www.dropbox.com/s/amjlr5gqk23et51/load.gif?dl=0)

\- To do this, we even had to build a fully featured distributed "coordination
testing" framework so others can build tests that verify things work even in
the harshest of conditions (it is for doing Jepsen like testing), check it out
here: [https://github.com/gundb/panic-server](https://github.com/gundb/panic-
server)

\- And we've managed this without locking the data to any particular model.
Graphs let you do key/value, relational, table, document, or graph type data.

This story is important, because it shows what can happen when people just
try. Referencing the quote from the other thread: Ivan Sutherland replied,
"Well, I didn't know it was hard."

~~~
mamcx
Ok, but where is the material to learn from? How see the "baby steps"?

With compilers you have a lot to get from, but database building are more hard
to find.

\--

One of the problems is that "databases" are assumed to be a ALL-OR-NOTHING
system, with a lot of parts (like storage, transactions, invisible-indexes,
triggers, cross-table validations, crash-recovery, query-optimizer) so is
assumed a huge undertaking, instead of something more small and contained.

\--

BTW, I'm on the hunt because I wish to build a relational language (alike kdb+
but with more "normal" syntax), for build the same kind of apps as with c#,
python, etc.

I only need on-memory processing and not storage or transactions (or maybe
transactions? I don't see how practical it could be for normal coding)

~~~
marknadal
This is a good point, there is not much good/accessible material on databases
- because a lot of that material is considered academic.

Honestly, I learned mostly by experience of what frustrated me with current
databases and then trying to come up with workaround solutions.

Then I practiced by playing around with new ideas and making small prototypes
in code. And then I would whiteboard out all the possible combinations of
those ideas.

Once I had all combinations listed, I reduced them all into their generic
rules and which ones depended upon others. Finally, based off those
constraints I worked out the mathematics of the system, the constraints, and
tradeoffs.

Maybe I should write a more intensive article on this to help other people. In
the meantime, I think if anybody can read/watch classes on data structures,
algorithms, etc. that will benefit them the most.

Recently, we have worked on a some interactive animated explainers. Not quite
database stuff, more like security (cryptography) and algorithms / data
structures. Let me know if they are helpful:

Cartoon Cryptography -
[http://gun.js.org/explainers/data/security.html](http://gun.js.org/explainers/data/security.html)

Cartoon Conflict -
[http://gun.js.org/explainers/conflict/lexical.html](http://gun.js.org/explainers/conflict/lexical.html)

Cartoon Sorting Algorithms -
[http://gun.js.org/explainers/basketball/basketball.html](http://gun.js.org/explainers/basketball/basketball.html)

Cartoon Concurrency -
[http://gun.js.org/distributed/matters.html](http://gun.js.org/distributed/matters.html)

And hopefully more to come. :)

I know those are probably not immediately useful for your kdb+ like language.
But, I'd love to learn more about what you are trying to do. :D

~~~
mamcx
>Maybe I should write a more intensive article on this to help other people.

This will be great. Look for example how this make "easy" to understand how a
(basic) GC can be _implemented_ :

[http://journal.stuffwithstuff.com/2013/12/08/babys-first-
gar...](http://journal.stuffwithstuff.com/2013/12/08/babys-first-garbage-
collector/)

Exist a lot of material in how understand at conceptual level things (that is
what papers are for) but the actual implementation is "left to the reader".
The trouble with that is that if I'm first learning about a subject, and I
only see the final result but not the steps to get there, I get lost in why
the final result is like that!

So, I imagine something along the lines on how a nano-database could work
(like, that only is a couple of tables (arrays or hash tables), and build from
first principles the relational operators, then maybe later this is how you
build the ACID, the query planner, etc). Not even necessary how build the sql
parser because that part is covered elsewhere (and also, I think SQL
_distract_ from the understanding on this matter, because is a half-good and
limited API to interact with a relational database).

> But, I'd love to learn more about what you are trying to do.

Most people have a pet theory that try push (like "make programming pure"). I
just wish to recapture the experience I have building apps with FoxPro -that
was more pleasant for database-based development than most- but with a modern
twist and without the limitations of the past.

I arrive to the conclusion that a relational language (in CONTRAST with a
relational database!) could be in fact great.

For example, in Fox you can write this:

[https://msdn.microsoft.com/en-
us/library/aa978284(v=vs.71).a...](https://msdn.microsoft.com/en-
us/library/aa978284\(v=vs.71\).aspx)

    
    
        SCAN FOR UPPER(country) = 'SWEDEN' AND isActive
           ? contact, company, city
        ENDSCAN
    

(This is a FOR specialized to walk a table with WHERE and other stuff. A
imperative ... WHERE sql + map + project)

Some example along this is:

[http://www.try-alf.org/blog/2013-10-21-relations-as-first-
cl...](http://www.try-alf.org/blog/2013-10-21-relations-as-first-class-
citizen)

My bet is that because the relational model is actually simple and provide
"universal" query abilities is possible to build a language similar in scope
to python+pandas, to build full apps, not just as part of the data
manipulation side of the equation in a rdbms in the server.

Eventually I found the APL/J/KDB+ family and have some similar ideas (only is
based on arrays, than in relations).

This is something I know can work. Fox was fairly popular in my time and it
die not for lack of people but because MS kill it (along classic VB) to focus
on .NET

~~~
marknadal
Very much agreed, good thoughts/ideas - I'll keep these in mind! Thanks!

------
irfansharif
Andy Pavlov's course offering 15-721 Advanced Database Systems at CMU is
publicly available through recorded youtube lectures, the spring 2017 offering
[1] and the spring 2016 one [2].

[1]:
[https://www.youtube.com/playlist?list=PLSE8ODhjZXjYgTIlqf4Dy...](https://www.youtube.com/playlist?list=PLSE8ODhjZXjYgTIlqf4Dy9KQpQ7kn1Tl0)

[2]:
[https://www.youtube.com/playlist?list=PLSE8ODhjZXjbisIGOepfn...](https://www.youtube.com/playlist?list=PLSE8ODhjZXjbisIGOepfnlbfxeH7TW-8O)

------
hackermailman
This is a good open course if you're interested in Peloton or designing your
own dbms
[http://15721.courses.cs.cmu.edu/spring2016/](http://15721.courses.cs.cmu.edu/spring2016/)
other versions [http://db.cs.cmu.edu/courses/](http://db.cs.cmu.edu/courses/)
didn't see this linked in the article.

------
filereaper
Major upvote, its nice to have a useable DB you can hack on i.e Peloton
([http://pelotondb.io](http://pelotondb.io)) and a course that backs it.

I just spent 3 hrs browsing through the listed papers on the course site.

Looking forward to being a commiter...

------
skierscott
> PhD students obviously stick around for longer but they are in graduate
> school to do research.

I've spent my first two years of grad school working on NEXT [1], a system
useful for active machine learning but not a real paper producer.

I'm planning a conversation with my advisor about getting less involved in
NEXT and more involved in research very soon (hopefully in the next week).
This article validates this conversation.

[1]:[http://nextml.org](http://nextml.org)

------
danso
A little OT, but my first thought of the kind of databases that academia would
inclined to pioneer was MIT's BayesDB, "A Bayesian database table for querying
the probable implications of data as easily as SQL databases query the data
itself":

[http://probcomp.csail.mit.edu/bayesdb/](http://probcomp.csail.mit.edu/bayesdb/)

