
Ask HN: Best stack for local data with remote long term storage - aetherspawn
Hi,<p>I have a small database box sitting in company A with a small amount of storage (i.e. 10GB).<p>In my company B, which has only about 99.00% uptime with company A (which has a slow internet connection), I have virtually unlimited storage and computing power.<p>I was wondering if anyone has ever heard of a database architecture that allows company A to host their local data until their database becomes too large, and then offloads rarely used rows to company B. Preferably, at that point, most common queries for recent data wouldn&#x27;t need to hit company B.<p>It would also be nice if company B retained a non-ACID backup of company A, so that if company A has a hard drive failure, at most no more than a few hours of data might be lost.<p>What is this architecture called? Are there any open source solutions that support it?
======
ciprian_craciun
This architecture (i.e. application and databases hosted by different
companies) is quite common in cloud-based setups where, for example one uses
AWS RDS for databases and Heroku for running the code. However,

However the main issue here is not the "ownership" (i.e. company A vs B) but
the network access (i.e. network of company A vs B) and liabilities (who has
to manage what).

In the previous example, although the DB runs directly on-top of AWS (company
A), and the code runs on-top of Heroku (company B) which in turn runs on-top
of AWS (again company A), most likely all these VM / servers are in the same
data-center, and any queries don't pass through the internet.

So in the end, if you do want to implement that setup, you have to be
extremely careful to the network level, and perhaps employ VPN's and
firewalls.

~~~
aetherspawn
Hey, thanks a bunch for the input.

In my proposed scenario, I would have expected there to be two databases - one
at company A and another at AWS (company B).

I plan to give company A a preinstalled server and ask them to just plug it in
at their premises. But the server won’t be big enough after 6 months of data,
so I want to charge them a small rate to push their data to my AWS when it
gets too full.

The technology to sync the two databases and determine which data to keep on
premises and which to push off to AWS is what I am trying to identify.

Hopefully that clarifies

~~~
ciprian_craciun
Unfortunately there is no "out-of-the-box" solution for what you are trying to
do.

It depends on the actual database (or storage solution) that you actually
employ; for example:

* in case of Postgres there is streaming replication; but I think by default (and perhaps there isn't a way to disable that) it also "streams" `DELETE` operations; so you still have to do things manually;

* in case of CouchDB (a document store) you could use replication, and then on the "on-premise server" (i.e. in company A) delete the old data and filter this operation in the replication;

However, in the end I bet you'll end-up implementing a custom replication
procedure. Which is not impossible, but not quite easy when you take into
account network reliability.

\----

A "quick and dirty" solution would be this:

* dump data (either in SQL, JSON, TSV or whatever fits your use-case) in batches, of say 1 day or 1 week per batch; compress that with `zstd` or `lzip`; (depending on your data it should compress quite nicely to at least 75%;)

* upload that to S3 for archival or staging area;

* if you need on-line access to the data, import it in your database on AWS side;

