
The database I wish I had - todsacerdoti
https://euandre.org/2020/08/31/the-database-i-wish-i-had.html
======
jeswin
Databases we have today aren't designed for the kind of consumer applications
we use today. They are designed for banks, Amazons and their smaller
counterparts - where a central entity (the business or corp) needs full
control over the data.

The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal
trackers, slack, basecamp etc) shouldn't be centralized in this fashion.
Data(bases) should be owned by users, and synced with friends and colleagues.
So instead of a single centralized database, we need a million tiny little
databases which are synced with peers.

Tiny databases fit well into today's hardware capabilities landscape. We have
plenty of fast storage, and powerful CPUs on computers and phones. It's time
to stop building as if the server is the only place where queries can run
efficiently. Truly serverless, that is. This also improves data security in
general; there is no central place to hack.

Happy to talk about this in detail. Please feel free to email me.

~~~
viach
> The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal
> trackers, slack, basecamp etc) shouldn't be centralized in this fashion

The evil CAP theorem enters the town here. Consumers want things to work fast,
it doesn't fit the decentralized approach.

> Databases we have today aren't designed for the kind of consumer
> applications we use today

Consumers don't really care what kind of database do you use. They want to
open the page and click the button, quick, now.

~~~
jeswin
> The evil CAP theorem enters the town here. Consumers want things to work
> fast, it doesn't fit the decentralized approach.

I think we have to distinguish 'decentralized' from 'distributed' in this
case. In a decentralized approach we reduce (by many orders of magnitude) the
size of data we're dealing with. Locally installed/browser-based applications
only have to deal with data that's created by, or directed at that user (and
locally stored).

Add: For a number of applications, hard consistency guarantees are overkill.
You could totally live with eventual consistency (some don't even need that).
In fact, other than e-com and banking I can't think of any of my daily-use
apps needing such guarantees. And indeed, decentralized data storage is a bad
fit for banking and commerce.

~~~
eternalban
Network hops cost time regardless of payload size. Distributed consensus
(which is what is required for a “decentralized” system) requires interaction
between distinct nodes.

You should note most consensus algorithm designs start with sharing a single
‘register’ (or ‘value’).

Most banks were doing “decentralized” relaxed consensus from the beginning of
“banking”. Local office accepts your deposit of your check. “Funds available
next day”. Because by “next day” they have confirmation from counter party for
the funds.

~~~
aabbcc1241
In your example, bank system can be seen as employing eventual consistency
approach.

------
teleforce
I'm longing and has the need for this type of database, and the time is ripe
now for someone to implement it. It should be a local first and
networked/distributed second, since IMHO the latter feature should be the sync
problems not database problem per se.

The closest to this idea is SQLite (open source and mentioned in the article)
and SQL Anywhere from SAP (commercial and not mentioned in the article). But
as the article pointed out most of the current solutions rely on SQL.

I'd love to have the new database system to be open source and uses modern
open data format for examples TileDB, Arrow and Parquet. The computation part
should be language agnostic and independent of the programming languages such
as JuliaDB or personally I'd prefer D programming language.

Coincidentally, around seven years ago Bryan from Joyent (Joyent's
presentation is mentioned as the main motivation early in the article) has
blogged about this similar effort targeting POSIX systems (specifically ZFS)
that eventually becomes Manta and now part of Triton a.k.a SmartDataCenter
[1]. For details implementation please refer to the original paper [2].

[1][http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-
meets-...](http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-map-
reduce/)

[2][https://queue.acm.org/detail.cfm?id=2645649](https://queue.acm.org/detail.cfm?id=2645649)

~~~
rapsey
SQL is way better at what it does then regular programming languages. Writing
queries with SQL is much more succinct.

~~~
willtim
That depends on the programming language, for example C# with LINQ and
functional programming languages are pretty good at expressing queries. SQL
has no capability for abstraction and so only the simplest queries will be
succint, if you want to e.g. union two similar queries then you'll have to
copy and paste.

~~~
hvidgaard
C# LINQ is basically very roundabout way of doing what SQL was made for. What
SQL does, and does really well, is operate on sets of data (aka tables) using
set theory. You can get succinct expression with C# and functional languages,
but they are not really possible to optimize the same way SQL is because it's
not bound by set theory the same way.

~~~
Joker_vD
[citation needed].

SQL by default operates on multisets instead of sets; iIt has no syntax for a
set (aka table) literal, SELECT 1, "a" FROM DUAL UNION SELECT 2, "b" UNION ...
doesn't count; sub-queries (using another SELECT in a FROM clause) are a
pretty recent addition; and the list goes on. It's not based on a set theory,
or on the first-order logic, or the relational model, not really. If it were,
it would be be Prolog.

~~~
saarni
Regarding table literals, some products support using

    
    
      VALUES (1, 'a'), (2, 'b'), ...
    

as a table constructor in general, not just in INSERT, and looking at SQL:92,
SQL:99, and SQL:2003 it looks to me like this should be standard SQL. Derived
tables, aka subqueries, exist in SQL:92 at least as well, so whether or not
that is considered recent depends on how you look at it, I think.

I am not trying to defend SQL with this, and all in all this does not take
away from the points you raised, but the above were something that stood out.

~~~
Joker_vD
They were in SQL:92, but IIRC adoption was somewhat slow and patchy, and has
generally finished somewhen in the early 2000s.

My point is, the parent's claim that SQL "operate[s] on sets of data (aka
tables) using set theory", and does it "really well" but when you actually
look at SQL, you realise that sets/tables aren't really first-class ― derived
tables were added in later versions of SQL, and literal tables still don't
exist, but those are things you expect a language focused on table
manipulations to have. Nope, it's a language for building very specific kinds
of queries which was then patched and extended into something more general.

------
jwr
I experimented with various approaches to client-side data handling and ended
up realizing that I do not want a client-side database at all. I'd much rather
hold data in language-native data structures (in my case,
Clojure/ClojureScript). These are very powerful and there is zero impedance
mismatch: after all, they are a part of the language.

Yes, you do need queries, but there are not many and they might as well be
coded as functions. You can do all the indexing you want, if you need to, but
JavaScript in browsers is so incredibly fast these days, that even a linear
search works well for most cases.

Unless you are handling gigabytes of data client-side, of course. But that
wouldn't make much sense.

Data modification needs to go through a server API anyway, because the client
environment can't be trusted, so two-way sync would be useless. I _want_ the
server to perform data model operations while verifying everything, I do not
want data sync.

~~~
treis
>I'd much rather hold data in language-native data structures (in my case,
Clojure/ClojureScript). These are very powerful and there is zero impedance
mismatch: after all, they are a part of the language.

>Yes, you do need queries, but there are not many and they might as well be
coded as functions. You can do all the indexing you want, if you need to, but
JavaScript in browsers is so incredibly fast these days, that even a linear
search works well for most cases.

That's just a bad database. It will work for small amounts of data and simple
queries. But add in something complex or lots of data and it will fall over.

~~~
mumblemumble
In theory, if it's all in a single application's memory and using langauge-
native data structures, then you have a lot of power to tailor the data model
to meet your application's needs. Barring concerns like object overhead (which
I realize is a major consideration in languages like JavaScript and Clojure),
you _should_ be able to come up with something that is more efficient than a
general-purpose solution like a database without too much trouble. It's really
just a special case of the Chuck Moore style, "I coded my own $thing in 1/10
the code and 10x the performance of the popular off-the-shelf option, in a
week. NBD, it's easy if you simply choose not to implement any of the features
you don't need." thing.

Where DBMSes truly shine is working efficiently with data that you can't or
don't want to keep in memory all at once, or for which lots of different
clients need concurrent access. The first of those (but hopefully not the
third) are absolutely considerations for certain kinds of browser apps, but
maybe not something the parent was worrying about.

------
obi1kenobi
I have to question the "relational" part. SQL is an absolutely _massive_
language, to the point where "we support SQL" is almost an empty statement
because there's no _single_ SQL to support.

While revamping the database, consider a smaller query language. I built a
GraphQL-based database query language in my GraphQL compiler project[1], and
it's been a huge success at my company so far[2][3]. Here's a demo of cross-
database querying with it, if you're curious [4].

[1]: [https://github.com/kensho-technologies/graphql-
compiler](https://github.com/kensho-technologies/graphql-compiler)

[2]: [https://blog.kensho.com/compiled-graphql-as-a-database-
query...](https://blog.kensho.com/compiled-graphql-as-a-database-query-
language-72e106844282)

[3]: [https://blog.kensho.com/database-agnostic-querying-is-
unavoi...](https://blog.kensho.com/database-agnostic-querying-is-unavoidable-
at-scale-18895f6df2f0)

[4]: [https://github.com/obi1kenobi/graphql-compiler-cross-db-
exam...](https://github.com/obi1kenobi/graphql-compiler-cross-db-example)

~~~
mamcx
I'm building a relational lang ([http://tablam.org](http://tablam.org)) so
this is close to my interests.

SQL is not just massive, it have grown past it actual capabilities and
original purposes without fix or clean it, then ALSO need to be the interface
for some of the most complex piece of software (rdbms) and support the needs
of a multitude of features, frameworks and languages. Is like html/css/js.

Going to basics, the relational model is fairly simple, and you can spice it
with functional and/or procedural features to round it and still keep it very
small. So yes, I think it can be done MUCH better than SQL and still do MUCH
more than it can right now...

P.D: My language TablaM (will!)is basically this plus a little of array
programming. You can fit very well the relational model in a lisp/tcl/datalog
like environment with a small core that is very expressive...

~~~
emmanueloga_
For the spanish speakers on this thread, the name "TablaM" reminds me of this:
[1] :-p

1:
[https://www.youtube.com/watch?v=8ktYyme_sUw](https://www.youtube.com/watch?v=8ktYyme_sUw)

~~~
mamcx
Ja!

In fact "TablaM" come to sound like "Kablam!"
[https://www.youtube.com/watch?v=9glnVMq1xrE](https://www.youtube.com/watch?v=9glnVMq1xrE),
a nickelodeon weird program of the past (when it do a lot of cool things). I
have a habit of put weird names to my projects...

Then I morph into tablam, and later in the game I create a pseudo-plausible
rationalization where "Tabla (spanish)" == "Table" and M is for "Multi-
paradigm, -platform, -dimension, etc" that is the one explanation I will use
in case get serious people asking me :)

P.D: I have a video for this at
[https://www.youtube.com/watch?v=Qs6byLBqyR4](https://www.youtube.com/watch?v=Qs6byLBqyR4)

------
yashap
> I wonder if I’ll ever be able to get this done.

I can’t help but be extremely pessimistic. This is an insanely challenging
project. Looking at the author’s other side projects:
[https://euandre.org/about.html](https://euandre.org/about.html)

I’d guess this project is about 10,000x-100,000x as challenging as the
Songbooks one. And the author has no obvious experience writing DBs, that I
can see?

This project seems at least as ambitious as CockroachDB. Development started
on CockroachDB 6.5 years ago, was founded by a group of Google alums with
relevant distributed systems experience (working on GFS), and it since became
well funded and has had a sizeable team working on it full time for years.
This project would be almost impossible to make significant progress against
without a strong, well funded team working on it for years, but the author
speaks about it as a side project.

~~~
kieckerjan
What does "10,000x-100,000x as challenging" mean? If the songbook cost him one
week, this will cost him at least 200 years? I love a good hyperbole, but this
is just disparaging.

~~~
bufferoverflow
Most databases that we use are written by hundreds to thousands of people and
take years to become reliable and mostly bug-free. It can easily take much
more than 200 programmer-years.

~~~
kieckerjan
By that yardstick nobody should ever attempt to write an OS or programming
language or game engine from scratch. I am glad there are plenty of
individuals who try.

~~~
Quarrelsome
ye but be realistic. If you're just one dude that has no experience in
creating operating systems then you shouldn't seriously try to write one from
scratch.

For giggles, sure go into the project with the idea that it will fail, that's
not an issue. However if you're serious then what the poster is saying is that
its over a lifetime's worth of work.

~~~
majewsky
> If you're just one dude that has no experience in creating operating systems
> then you shouldn't seriously try to write one from scratch.

In the alternate reality following your dictum, Linux does not exist.

~~~
freehunter
The Linux that was created by one person is far from the Linux that exists
today. The Linux that exists today was created by hundreds (thousands?) of
people over the span of decades.

------
allenu
I've built out a journal layer on top of SQLite for a flash card app I'm
working on. I wanted to be able to sync data across devices but I knew that
I'd always just have to handle a single-user/multi-device scenario and never
multi-user/multi-device. I was able to come up with a simple CRDT style system
where I just log the modifications made per device and share them out with
other devices via simple files (i.e. upload to central file store and download
peer journals). Each device then has to reconstruct the state of the SQLite
database entries using the shared journal files. Since it assumes single user,
I can safely use "last write wins" as the conflict resolution strategy.

It's working reasonably well for my small use case, but there are definitely
limitations that you have to be aware of. The big one is the fact that it's
journal-based means the journal can grow and grow. Since I use simple files as
the protocol for sharing changes, this means uploading/downloading the same
content more than I'd like. If I had a stream-based server that just handled
the deltas, it would be better, but my goal with this project was to avoid
having to roll my own service. (I wanted users to provide their own cloud
storage, say iCloud or Dropbox, and the app would just use it as a central
repo, meaning no fees for maintaining the service.)

There are tricks around these limitations that I can work on if I had the time
(my main goal is to build the app and not a sync system). For example, I could
cap the journal files to a certain size and create a new one when the limit is
reached. This avoids having to upload the same contents too frequently. I
could also have a strategy where I merge all journals into a central file
every so often so that the state of the world is shared there instead of the
journals (which can then be discarded).

~~~
pdimitar
Interesting, thanks for sharing.

I'm curious as to why you didn't use sqlite's changeset / patchset API?

~~~
allenu
Interesting! I wasn’t aware of that feature. I actually built out my idea
first as its own toy database with journals and later added SQLite as the
storage mechanism.

Patchset could work for this, but I’d have to play with it a bit. Off the top
of my head, I’d need to redesign things to consider time in terms of sessions
as opposed to granular changes. i.e. today I can just append deltas and use a
time stamp to know how it fits chronologically with other changes in peers. A
new peer created could sort all deltas to play back the database from scratch.
I suppose with patchsets it could sort them all by start time and apply them
one after the other to recreate the database.

It’s an interesting problem because you can’t assume to have synced recently,
so you don’t know the baseline you are starting with, but for the application
I’m building, that may not matter.

------
mey
Immutability is an interesting property, but reality is messy. Depending on
the use case you envision for the database, data that needs to be scrubbed
will accidentally be added to the system. This could be an important private
key, or something legally dubious. In an append only/sync system there isn't
really a good way to scrub that data. (Replay the entire history, filter bad
data, then delete all remotes and resync from clean copy is not good). Again
depends entirely on the use case of the database you envision, but if you wish
to make it general purpose it is something to consider.

~~~
Spooky23
It’s interesting because it reflects the reality of modern computing — for
many use cases storage capacity is not a meaningful constraint.

I ran pretty important systems as a DBA years ago that we’re significant but
would never, ever require a meaningful constraint on growth. (Performance,
different story)

Sometimes looking at a problem like a database with a new set of constraints
leads to interesting and novel things.

~~~
foo_barrio
I've been working on personal and professional projects where growth isn't a
truly restricting constraint but is still an actual constraint. What do you
mean by "meaningful"? Once I get into the 10's of terabytes there's no storage
system where copying a TiB of data is "instant" or ignorable.

Professionaly I work on a financial reporting system where there are around
1.5 million accounts with trading activity. Things like moving a month of data
around and checksumming still take on the order of 1 hour. If the size of that
monthly dataset doubled, we'd be OKAY but it it grew 10x we'd have to start
restructuring things both on the software side and the hardware side.

For personal work I am working with searching video frame substrings
regardless of codec, resolution etc within videos in a ~50 TiB video set.
Having to re-encode hundrends of GiB of video at a time is some thing that
takes a huge amount of time and nothing something I can just "throw into an S3
bucket". The growth of said working set is not truly constrained but is
actually constrained. I can only afford so many 8-10 TiB HDDs at a time! They
also fail and have to be mirrored/backedup with rsycn at the very least which
adds to the cost for me.

~~~
Spooky23
I mean that it’s not a technical constraint or limit. Not that its trivial to
do! If there is an ROI in keeping stuff forever, you can.

Think about what you’re saying about your personal project. What you’re
talking about wasn’t financially achievable for TV companies just 20 years
ago. Hell, anything older than a few days would be stored offline in some tape
silo, or in a shelf. (And 10-12 people to attend to it!) Now you as an
individual _could_ afford to do all sorts of stuff with that video data,
albeit for some significant $.

------
cafxx
> SQLite [...] assumes a POSIX filesystem that would have to be emulated.

No it does not. It has a pluggable storage interface that makes no such
assumption: [https://www.sqlite.org/vfs.html](https://www.sqlite.org/vfs.html)

~~~
EuAndreh
Thanks for the correction, I've amended the text.

[https://euandre.org/2020/08/31/the-database-i-wish-i-
had.htm...](https://euandre.org/2020/08/31/the-database-i-wish-i-
had.html#fn:posix-sqlite)

------
l0b0
Especially an immutable database (like Git's index) sounds like a very
powerful concept. No more need for modification dates (they would be in the
change record) and no need at all for separate log tables and all that guff.
Every CRUD system ends up with some sort of audit log tacked on, costing heaps
in terms of both maintenance, storage and extra processing for every change.
In addition I would want there to be a command analogous to `vacuum` to
truncate this history while the database is online, similar to Git's rebase or
filter-branch.

------
jnwatson
I already got one? My employer mostly has these requirements. We ended up
writing our own query language because it is a graph DB, but all the bits are
there.

LMDB + a journal + a query interpreter is really all you need. We don't have
off-line mirroring yet, but that's on the list.

You could start with Synapse [ _] and rip out the bits you don 't need. It is
open source.

[_]
[https://github.com/vertexproject/synapse](https://github.com/vertexproject/synapse)

------
virtualwhys
> What I miss from it that SQLite doesn’t provide is the ability to run it on
> the browser.

Cries in abandoned Web SQL[0]

[0]
[https://en.m.wikipedia.org/wiki/Web_SQL_Database](https://en.m.wikipedia.org/wiki/Web_SQL_Database)

~~~
CodesInChaos
IMO compiling sqlite to webassembly is a better approach than declaring a
specific implementation a standard and integrating it in browsers.

~~~
osmarks
I don't think browsers have a good API it could store to, though. As far as I
know there isn't really anything for storing efficiently in-place
updateable/readable large binary blobs.

~~~
CodesInChaos
What about IndexedDB?

A basic implementation would use one document per page. An advanced
integration might replace the whole sqlite storage engine with one based on
IndexedDB.

~~~
osmarks
That could work, I forgot SQLite worked by pages. It's still probably not
ideal, though.

------
legg0myegg0
I think DuckDB checks a number of these boxes! It is embedded and written in C
so compilable to WASM. It is also 10x faster than SQLite and interoperable
with Apache Arrow! It might be a good place to start anyway!

------
andriosr
I think it's crazy that database updates are still acceptable in the design of
systems and databases. Updates are deletes in practice, you lose what was
there before.

Getting to know how Git and blockchains in general works will help in your
journey. I used these concepts in the design of an immutable ledger system,
the result was nice: [https://decimals.substack.com/p/things-i-wish-i-knew-
before-...](https://decimals.substack.com/p/things-i-wish-i-knew-before-
building)

~~~
dan-robertson
GDPR requires you to delete the information you have about people under some
common circumstances, if they ask you to.

~~~
andriosr
Sure. Immutable doesn't mean you can't delete things:

> True. And what makes it easy is it's immutability. When you change history
> in Git, you rewrite it (rebase). But re-writing is only possible if you know
> every bit of the journey up to now. When you have actual updates, history is
> lost.

[https://news.ycombinator.com/reply?id=24340659&goto=threads%...](https://news.ycombinator.com/reply?id=24340659&goto=threads%3Fid%3Dandriosr%2324340659)

------
hden
Datahike is an immutable Datalog database (like Datomic) powered by the
hitchhiker-tree.

As of v0.6.0 it use hitchhiker-tree synchronization for replication.

[https://github.com/replikativ/datahike](https://github.com/replikativ/datahike)

~~~
knubie
To clarify: it is planned to use hitchhiker tree synchronization in v0.6.0.
The latest release as of now is v0.3.1.

------
0xCMP
Also have wanted this. Operating on data locally means it can be fast and low
latency (on device, not between devices).

I think the secret lies in using CRDTs to sync data in a way that allows it to
be processed into another system. E.g. Postgres on a server and SQLite on a
phone.

~~~
mbalex99
You should definitely take a look at what we’re doing at www.ditto.live

------
spankalee
Sounds very much like noms: [https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)

Not sure what the status of that is since Attic Labs was acquired by
Salesforce.

~~~
tantalic
Noms has gone pretty quiet since the acquisition. Dolt is based on noms (not
sure how much is has diverged) and much more active:
[https://github.com/liquidata-inc/dolt](https://github.com/liquidata-inc/dolt)

~~~
aboodman
FYI, some of us have started working on
[https://replicache.dev/](https://replicache.dev/) instead, which shares some
DNA with Noms. But it's not decentralized.

------
sharabara
Upvote for Syncable & Embeddable. It will take a massive amount of time to
build from ground up, so I wonder if it's sensible for you to build SQL on top
of RocksDB like CockroachDB did:
[https://www.cockroachlabs.com/blog/cockroachdb-on-
rocksd/](https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)

------
naranha
Aren't relational AND syncable very difficult to combine? Foreign key
references for example may not yet exist on one instance when they are synced.
P/CouchDB and Git support multi master sync which is great, but they are not
relational.

The workaround is to detect foreign key violations in your app with indexes or
a job that runs on the nodes.

~~~
sradman
Log shipping works well for single master scenarios and these solutions can be
combined with a partitioned primary key space for inserting new rows in the
remote databases.

A two-way embedded/server relational sync is a much more difficult problem but
it has been solved before. SQL Anywhere is a commercial relational database
that has exceptional synchronization capabilities for fleets of mobile
devices. Its MobiLink gateway synchronizes with heterogeneous DBMS servers
like Oracle.

A new relational sync project should become familiar with MobiLink, in my
opinion.

------
ultrasurf
I think what's even better than relational may be directly using OOP objects,
using the system prevalence pattern.

If you're willing to model your "DB updates" using event sourcing, you could
fork something like Prevayler to achieve your immutability needs while being
able to serve as source of truth across multiple nodes.

~~~
eyelidlessness
OOP isn't immutable

~~~
ultrasurf
Technically, sure, but you can always deep-copy/serialize/etc. to make it
immutable.

My point is that you can achieve all the OP's needs without representing your
model in SQL.

~~~
sooheon
SQL was not one of OP's requirements.

------
mikealrogersHN
I have some good news for you :)

Hi, I’m Mikeal, I created PouchDB. After a year or so I handed that project
off to some great maintainers that made it much better as I had grown a little
skeptical of the replication model and wanted to pursue some alternatives.

It’s been about 10 years, much longer than I thought it would take, but I have
a young project that finally realizes the ideas I had back then.

Sometime after PouchDB was created I realized that it just wasn’t going to
work to replicate a whole database to every client. In fact, even a view of
the database wasn’t going to work, because the developer isn’t in a position
to really understand the replication profile of every user on all of their
devices, you need a model that has partial, or more accurately “selective”
replication based on what the application accesses in real time.

I became convinced that the right primitives were already present in git:
merkle trees. Unfortunately, git did a very poor job of surfacing those
primitives for general use and I wasn’t having much luck finding the right
approach myself.

Shortly after joining Protocol Labs I realized they had already figured this
out in a project called IPLD. Not long after that, I started leading the IPLD
project/team and then putting together my ideal database whenever I found a
free moment.

It’s very young, lots of missing features, still working on some better data-
structures for indexing, but it is very much a database that replicates the
way git does and approaches indexing over a primary store the way CouchDB
does, but there’s a lot more too.

With these primitives we can easily nest databases inside of other databases
(and create unified indexes over them) and we can easily extend the data types
in the database to user provided types. Using some of these features it
already supports streams of binary data, databases in databases, and linking
between pieces of data.

Let me know what you think, there’s still plenty of work to do, but it has
100% code coverage, so it does what it says it does :)
[https://github.com/mikeal/dagdb](https://github.com/mikeal/dagdb)

------
sroussey
Gotta figure out what your needs are. For example, if browser and node is a
requirement than you can look at RxDb which uses Pouch and either Indexeddb or
various file options.

If you want everything, it’s going to be big and not appropriate for web or
mobile unless part of browsers or the OS.

~~~
dsun175
Also I see a big conflict in wanting relational data and a fast/secure
replication protocol.

------
silasdavis
Reminds me of
[https://github.com/davebryson/bftdb](https://github.com/davebryson/bftdb):
sqlite replicated by Tendermint (PBFT-ish consensus)

------
rbjorklin
It sounds like Irmin combined with the GraphQL bindings is what he wants:
[https://irmin.org/](https://irmin.org/)

~~~
dwenzek
Indeed, Irmin is a library to implement data stores which can be freely
cloned, updated and merged along a model similar to git.

It separates the content from the actual store. So it can run as well on a
file-system, on a git repository, in memory or in the browser (See for
instance [https://github.com/talex5/irmin-js](https://github.com/talex5/irmin-
js)).

As git, Irmin uses 3-way merge but applied to arbitrary values and not only
files. As described here
[http://gazagnaire.org/pub/FGM15.pdf](http://gazagnaire.org/pub/FGM15.pdf) and
generalized here
[https://kcsrk.info/papers/oopsla19-mrdt.pdf](https://kcsrk.info/papers/oopsla19-mrdt.pdf)

------
brynb
Thought I'd share a project I've been working on over the past year that meets
at least a subset of the criteria from the article:

[https://github.com/brynbellomy/redwood](https://github.com/brynbellomy/redwood)

Redwood is a decentralized database providing a data model that's much like a
Redux state tree. One of the fundamental design goals was to better harmonize
how data is handled on frontend and backend layers.

It's also:

\- syncable/offline-capable, thanks to a CRDT system

\- able to serve as a standalone Git backend

\- embeddable into Go applications

\- immutable, in a blockchain-y sense, due to its append-only ledger

There are some demos in the Github repo:

\- A real-time collaborative text editor

\- A chat application

\- Git integration

\- A video streaming application (still improving performance here, but it's
functional)

It's pretty alpha still, but feedback and contributions from anyone interested
would be very welcome.

------
thomasfromcdnjs
I think the more important question he didn't answer was "Why?" he wants a
database like that.

------
bambam24
I have Been there, Over engineering kills startups,

------
Phil-bitplex
It's a super interesting set of requirements - and I've been mentally trying
to wrestle with an extra ones on top of that - selectively syncing data.

Let's say I've got three users of my bookkeeping app. Person A and Person B
are in different companies, totally separate data. Person C is a bookkeeper
and wants to work with both their sets of data.

It's a trivial example where you can just say "well you just keep Person A's
and Person B's data separate and switch what the bookkeeper gets to see when
they want to reconcile separate accounts".

I don't have any good solutions for it yet, but I'm enjoying exploring them.

------
arxpoetica
There's some cross over (not 100%) with an idea I've felt is needed in the
data marketplace. See links:

[1] [https://github.com/datasynq/synq](https://github.com/datasynq/synq) [2]
[https://arxpoetica.com/articles/2020/07/20/why-data-
needs-a-...](https://arxpoetica.com/articles/2020/07/20/why-data-needs-a-
standard)

------
garfieldnate
Yes please! I am all for anything that helps users own their own data. See
also Solid ([https://solid.mit.edu/](https://solid.mit.edu/)) and "local-first
software"/CRDTs ([https://www.inkandswitch.com/local-
first.html](https://www.inkandswitch.com/local-first.html)).

------
jrockway
I just want a database that takes no configuration options except for a list
of peers and a disk partition to write to and provides strictly serializable
transactions 100% of the time (modulo partitions).

I'd rather start with slow and correct, and implement hacks to improve
performance once slow becomes "too slow". But instead, database engines are
sold on benchmark performance, so out of the box they lose transactions they
told you they committed, or read uncommitted writes, or all sorts of other
crazy things. I don't want this unless I write code myself to do those things.
(If I go through the API, it must be completely consistent. If I clone the
drive of the database server and pick around at random disk blocks, sure,
maybe I'll see some uncommitted writes.)

~~~
implfuture
Check out FoundationDB! It’s a distributed key value store with strictly
serializable txns by default.

~~~
jrockway
I really like FoundationDB. It still feels like that system where if it breaks
you will be blamed aggressively for picking a weird thing, so I'd like to see
some more success stories, I guess. But if I really care, maybe I should be
the success story.

The Go API is also kind of weird, and conflicts with my general thought on Go
code that every I/O operation should be cancelable with a context -- they
really don't subscribe to that point of view.

Those are the main reasons why I don't use it... but I guess they have the
fundamentals down (the "foundation" if you will), so it's probably worth
powering through those nits and start using it.

------
rodmena
I am looking for a hero to test that 280TBs limit of SQLite.

------
nanomonkey
Sounds like Hypercore (DAT protocol) or Secure Scuttlebutt.

~~~
sbazerque
Yeah, except Hypercore or SSB sync'ed data types are not as flexible as the
relational model.

I'm building something like this, with one extra requirement: permissionless /
trustless p2p operation. It works in the browser, here's the README:

[https://github.com/hyperhyperspace/hyperhyperspace-
core](https://github.com/hyperhyperspace/hyperhyperspace-core)

~~~
nanomonkey
I'm not sure what you mean by not as flexible as relational model. I'll
address SSB, as I'm more familiar with it, as I've built a CRDT system on top
of it in clojurescript.

Each message can be linked by its content addressable hash. The engine allows
for Map/Filter/Reduce queries which are not much different in capabilities
from SQL queries, these can even be indexed to speed up views. The only real
difference is that no schema is enforced on each message type, but it's not
that difficult to enforce something like that on a higher layer when
publishing entries.

If you think about it, storing JSON entries is a lot more flexible than table
rows. Also, the content stored need not be JSON, but EDN, transit or reference
to binary blobs.

~~~
sbazerque
> Each message can be linked by its content addressable hash.

What I had in mind is, while as you well said, you can link the data in a pub
in any way you need, the distribution model is fixed. The only way to change
that is creating more pubs.

I think the distribution model is an integral part of the data structure in a
general p2p setting and needs to be customizable, within reason.

Cheers!

~~~
nanomonkey
Thanks for your response, I find this stuff super interesting to watch evolve.

I think I understand a bit of your concern, although note that the pubs aren't
necessarily required for replication in SSB, one can use onion routing,
physical devices (sneaker nets), local area connections, etc. But I agree,
that there are more areas to improve, such as changing the
network/capabilities keys on the fly from within a message to ratchet the
distribution, making reference to content hashes within different network
keys, etc. I'm curious if your work takes any of this into account? Thanks
again.

------
andi999
Maybe one can learn from and avoid the biggest mistake of sql: mixed commands
and date. Maybe one can separate the command from the data in the control.

------
bionhoward
I totally agree and love this list, but also wish for graph queries like
OpenCypher. The graph community really needs something like that because right
now it’s all centralized, expensive and proprietary, or tied to the JVM. If
SQL standards evolved to include a MATCH statement and ASCII art for
traversals, it would make JOINS so much easier

------
asadovsky
Take a look at Syncbase
([https://vanadium.github.io/syncbase/](https://vanadium.github.io/syncbase/)),
it has some of these characteristics including secure p2p sync and a SQL-like
query language. It's not under active development but might be a decent
starting point to pick up from.

------
chromedev
Have you ever used a flat file database? I envision the perfect database
having a file system API that automatically sorts and can be edited using
standard files, but stores data using like JSON or YAML. That way it can be
versioned and queried using the standard file system tools, or those that
implement the Unix Philosophy.

~~~
daxfohl
I tried this for a small project once about 10 years ago, using svn as version
control and putting a GUI around it. It was clunky, with no good way of doing
uniqueness or foreign key constraints, and slowed way down after ~1000
records. I ended up moving to a regular database and implemented the
versioning logic manually, which was pretty straightforward.

At first this irked be because the final solution didn't retain the full power
of a VCS. No merges, no branches, etc. But what I finally came to terms with
was that it didn't need to (YAGNI): it just needed historical records and that
was pretty easy. It ended up being much cleaner and faster that way.

~~~
wccrawford
I think that you'd have to also implement indexes to make that work, just like
SQL databases do. Even with SQL, I ran into plenty of cases that indexes were
_necessary_ for decent speed. Flatfiles would just amplify that.

And, if you're trying to use the VCS to do merges, the indexes will constantly
get messed up, so they'd have to be reindexed after every merge to be sure
they were legit, and it still wouldn't ensure uniqueness... Ouch.

------
pjmlp
I am yet to see any GB/TB project that cannot be handled by traditional RDMS
systems, so I don't wish for new databases other than when I have to work
without either Oracle or SQL Server and wish to have their productivity
tooling around.

------
juancampa
Sled might be a useful library to build your DB on:
[https://github.com/spacejam/sled](https://github.com/spacejam/sled)

------
xwdv
I just want a database that implements a soft delete function natively.

------
PaulHoule
What you need is a "database construction kit".

------
DominikD
With those priorities in mind I'd probably start with RocksDB and build SQL
engine on top. I'm really impressed with RocksDB performance, space
requirements and feature set.

------
afinlayson
I've always wanted a git based database.

------
Shelnutt2
Disclaimer: I'm a member of the TileDB team.

> embedded, immutable, syncable relational database.

TileDB Embedded ([https://github.com/TileDB-
Inc/TileDB](https://github.com/TileDB-Inc/TileDB)) checks all four of these
boxes. It is a universal storage engine based on dense and sparse multi-
dimensional arrays. I explain each point below.

> embedded

TileDB Embedded [1] is an open source (MIT licensed) embeddable C++ library
which exposes C and C++ APIs. We also built APIs for Python, R, C++, Java and
Go, and integrations into MariaDB, PrestoDB, Spark, GDAL, PDAL and more. Via
MariaDB we even have embeddable SQL[2]. TileDB abstracts the storage backends
behind an extendible VFS class, and currently supports S3, GCS, Azure, HDFS,
and local disk.

> immutable

TileDB is designed around immutable objects. Every write creates a new
"fragment" using a MVCC model[3]. No file is ever updated in place. TileDB is
designed to naturally handle the eventual consistency and restraints of cloud
object stores. This allows for features like multi-reader/multi-writer
support, time traveling, and update support all within the storage engine
without any external orchestration.

> syncable

TileDB's MVCC approach to handling the eventual consistency of cloud object
stores, yields directly into its sync-ability. Every write operation in TileDB
creates a new "fragment" [3]. A fragment is an immutable folder (or prefix on
object stores) that contains all the data from a single write session.
Fragments are created with a timestamp and a UUID to ensure uniqueness. The
fragment is ignored until a special `.ok` file is available in listing (each
write is atomic). Incomplete fragments (without the .ok file) are gracefully
ignored.

This means that every fragment is self-contained and syncable. The only
requirement is that the special `.ok` file must show up only after the
complete fragment folder is synced. With cloud object stores this is not a
problem with the read-after-write consistency guarantees. For other systems
the synchronization can be managed to handle this behavior.

> relational database

Tables can be easily modeled as multi-dimensional sparse arrays[4]. Through
TileDB's integrations with MariaDB and PrestoDB (and Spark), you gain a full
SQL interface, while being able to query the data directly via the language
APIs without using SQL. For more robust features, such as foreign key
enforcement, it is easy to use embeddable MariaDB to achieve this, while we
work to push such features into the storage engine itself. We are always
seeking feedback on which features are most used and requested to pushdown
common operations to the storage engine.

[1] [https://tiledb.com/embedded](https://tiledb.com/embedded)

[2] [https://docs.tiledb.com/main/api-usage/embedded-
sql](https://docs.tiledb.com/main/api-usage/embedded-sql)

[3] [https://docs.tiledb.com/main/basic-concepts/physical-
storage](https://docs.tiledb.com/main/basic-concepts/physical-storage)

[4] [https://docs.tiledb.com/main/handling-
dataframes](https://docs.tiledb.com/main/handling-dataframes)

------
v3gas
Love this! Rooting for you!

------
winrid
Would firebase work?

~~~
buzzerbetrayed
Firebase (whether it be Firestore or realtime database) has almost none of the
desired properties in the article. The only one it does have, as far as I can
tell, is "syncable".

~~~
winrid
Yeah, just curious what they actually want to do with it that requires these
properties.

------
miki123211
I believe decentralization is a cool concept from an engineering perspective,
but it creates more problems than it solves. The current model of big,
centralized services is kinda like democracy. Bad in a lot of ways, but
everything else has even bigger problems, so that's what we're stuck with.

First, it's very hard for end users to actually connect and discover
eachother. NATs are everywhere, so direct connections are often not possible,
and introducing a central server defeats the point. Most non-techies spend a
majority of their time on mobile, so having connections in the background is
also not an option. Everyone having their own servers is not realistic either.
It has been proven time and time again that, if people can choose between
paying or getting something for free, they will choose the latter. Whether the
ads are personalized or not (what some people call tracking) doesn't matter
that much.

Monetization is hard with decentralized services, and money is very often
required to make something with good enough UX for most people. If you don't
collect a lot of data, you make ad fraud easier, so ads are not an option.
People don't like to pay (see above) and payments are hard to enforce if
you're not managing the service, so charging is not an option either.

Making decentralized services is much harder than making centralized ones.
It's easier to pretend everything in the world happens in order and accept
events one by one, than to deal with the messy reality of thousands of devices
doing thousands of things at once, without a coordinating entity. Imagine if
one user kicks everyone out from a group, while getting kicked out himself in
the meantime in other parts of the network. What to do when he goes back? Who
should decide? Timestamps can't be trusted either, as they can be faked. Now
you're having consensus problems, dealing with byzantine fault tolerance and
inventing a blockchain. A central, trusted server is easier to write and less
prone to security exploits.

Storing everything on your own device isn't great either. If you have 100
friends on Facebook, and each of them has 1000 photos, you don't have enough
space to store them all. Not storing them isn't an option either, as you might
actually want some of them at some point. Asking the original poster doesn't
work, as he might not currently be in the app, so disconnected from the
network. With something like a decentralized Amazon, you need search, and
good, privacy-friendly, decentralized search is somewhere between hard and
impossible.

If you can't have captchas, rate limits and a lot of data about your users,
it's hard to prevent spam, spreading fake news, attacks, catfishing and
similar behavior. Restricting your network to friends works, as long as you
accept the fact that no stranger (i.e. at your new place of work) can ever
find or contact you.

Even if you manage to overcome all those problems, there's still a pretty high
risk that someone might do an embrace, extend, extinguish on your network with
their proprietary app, kind of like Gmail is doing with email.

I want decentralization to work and Facebook to be gone just like everyone
else, but, as sad as that is, it never will.

------
RedShift1
Ah yes immutability. Until GDPR hits you :')

~~~
dluan
What are the concerns with GDPR that a user-owned data approach like
Scuttlebutt wouldn't cover?

~~~
PeterisP
I'm not sure what exactly Scuttlebutt covers or does not cover, but an example
real problematic situation that I have seen is like this:

1\. The system among other things stores user-entered content;

2\. User A has published sensitive personal data about person B (perhaps in a
doxxing-like post, perhaps by making a fake profile with their name and data,
perhaps including interesting images), and to make things interesting also
regarding other laws in other countries, let's assume that person B is
underage (which is also realistic and common, school kids do things like that
to each another).

3\. Person B requests that user A removes that content, and A refuses.

4\. Person B escalates (the specific process does not matter much) finally
resulting in you getting a legally binding request to remove that information
about person B from your system/product.

So the key tech requirements include both breaking immutability and full
centralised control; you can do decentralised systems for e.g. performance and
bandwidth reasons, but fully decentralised, user-controlled censorship-
resistant technologies can't be used because then you can't implement
censorship, and you need that ability.

~~~
RedShift1
There doesn't even have to be a person B. Under the terms of the GDPR you have
the right to an erasure request, which means the party holding your data must
erase your data, whatever that may be, and it must be erased from backup
systems too.

------
dgrin91
> An embedded, immutable, syncable relational database.

Sounds like he wants a blockchain?

A blockchain probably isn't really what OP wants. But funnily enough it does
meet all those criteria.

> what I’m after is a database to be embedded on client-side applications
> itself, be it desktop, browser, mobile, etc. Thats how most Blockchain
> clients work - on the client side, with APIs (usually RPCs) to chat with
> other programs

> Being immutable means that only new information is added, no in-place update
> ever happens, and nothing is ever deleted.

Immutability is exactly what blockchains were designed for.

> Syncable

Since blockchains are decentralized by nature, this also comes out of the box.

Of course if you choose a blockchain you have to deal with a host of other
issues... performance, no SQL (in most clients), harder to debug, etc. But
still a fun thought experiment

~~~
haggy
How is blockchain relational?

~~~
xvolter
I do not think it is.

Blockchains are meant as a data store. Archival, verifiable. But not easily
searchable. There is no design on blockchain for advanced query
languages/logic.

It seems like the OP is going after something like blockchain that adds
querying. But I don't think blockchain syncing is designed well enough for
client-side use, and the OP seems to want to use this database client-side in
JavaScript. I think it would be possible to implement blockchain syncing with
WebRTC, but WebRTC has some security risks for users (IP leakage mostly) so I
am not sure I would start off with that as a primary communication channel.

