
Ask HN: Which distributed database solutions would you suggest? - iKSv2
Hi guys,<p>One of my project, started in 2013 has grown beyond the point of self-healing. It was cool active-active sync on the webapp (behind HAProxy) and one high-spec&#x27;d DB Server (both web servers would connect to this db server). Current DB size is ~145GB (Based on backup, actual size might be more) and its postgresql (upgraded to 10).<p>Now the problem is that the database has grown beyond the point of ease. The constant need to start services, stop long executing queries (easily takes upwards of 20s for simple search). Higher ups have asked to look for Distributed solutions ( I am strongly against NOSQL while management is strongly against paid solution but they might give in after I tinker with some sort of trails &#x2F; open source versions). I am looking for guidance from you guys as are you using any distributed database in production (with the library available in php for connecting and stuff obviously).<p>I have tried installing cassandra (after which I am strongly against NOSQL) and citusDB which has some very serious limitations on the community edition (their open source offering) and also in general. Some more searching results in Postgres BDR, Cockroach DB, TimescaleDB (but this doesnt partition on NON-TIME columns).<p>My requirement is:<p>* to have a distributed database solution in place which is horizontally scalable.<p>* Easy to add nodes, remove nodes without any downtime (sure I can accomodate some write-locks for setup) 
* Have the ability to tweat replication factor. Ideally I would love to have replication = number of nodes i.e. each node has complete database. So that when there are simple queries, it doesnt have to do distributed queries (which make system slow) and when the query is a bit complex, it does distributed since data is available in each node.<p>* Best case: Some GIS based plugin &#x2F; extension on the solution would be icing on the cake.<p>* SQL compatible, so that least of the application rewrite is required.<p>Please help ..
======
zzzcpan
There is no in-place replacement.

Most likely your app relies on database performance for a lot of queries, that
can be fast only on local databases. This makes a serious rewrite necessary if
you want to go with distributed solutions and requires knowledge and
experience in distributed systems. Which takes a lot of time and that's why
choosing distributed databases from the beginning instead of MySQL/PosgreSQL
is so important.

Judging by your description, your best bet is probably pre-sharding. It's not
a proper distributed solution, but it doesn't sound like you need one. It
works like this: you partition your data into N shards, where N is two orders
of magnitude higher, than the number of server you plan to have, then split
each table into N tables belonging to each shard and use a static mapping of
shards to servers in the application. Queries won't be able to cross shards.
Some work on the app will be necessary and on the infrastructure tooling to
manage servers, shards, replication, monitoring. As you would need to be able
to move shards between servers, restart them, backup them and so on.

~~~
iKSv2
Say suppose we go ahead with keeping the system as is and re-write it
parallelly (essentially beginning) what distributed database would best help
our use case. (Fast, distributed, ACID, transactional, no single point of
failure, should have some GIS functionality)

------
AznHisoka
You need more information on _why_ you need to constantly start services, and
why there are long running queries.

The latter is actually one of the easiest problems to find a solution for. If
your problem was too many concurrent transactions/high load, then that would
involve distributed solutions. But slow queries? That is solvable with less
work.

~~~
iKSv2
The queries are slow because the resultant dataset is of high volume and then
we are acting on the data. While this can be remedied (its being worked on) -
I saw some demos on the web for distributed systems and think sooner or later
I need to do that. Makes senses to do now

------
eb0la
Try old school tricks first:

\- Explain your queries: you will be surprised to know a lot if stuff madre
with ORMs is not well indexed.

\- Take a look at the slow query log (and run that queries with Explain
before).

\- Exploit recency: I bet most reads are from recent data.. Or data that was
written near other data you ask... If that is the case xonsidera partitioning
the tables.

\- Thing about the best index type form your tables. Some are better than
others.

\- If some data is not going to be written again, a columnar data store would
help...

Don't go distributed at first.

Disributed means when something fails you have to know what abd where is
failing... Instead of knowing just what.

If you want to go distributed.. Try sharding right now. If it works for you,
add a cache and use the shards.

But try old school, boring stuff first because you might need it later.

~~~
iKSv2
Will check what me/team can do on old school lines. Thanks

------
bufferoverflow
First make sure you can't solve your problem by moving to a more powerful
server. 145GB is tiny. 256GB of DDR3 is less than $600 on eBay, and can fit
your whole DB.

Also, are you on a SSD?

~~~
iKSv2
I certainly could. Yes I am on SSD. But then moving to a bigger server is
essentially just delaying the problem I am going to have to solve sometime in
future. Also right now I can afford some time window of bad (slow) performance
while I essentially migrate.

~~~
JimmyAustin
You'd be surprised how far you can push the problem into the future. OVH will
sell you a server with 512GB of RAM for under a K a month. You can easily go
into the terabytes.

~~~
bufferoverflow
Hetzner PX92 with 256GB DDR4 is just $219/month.

------
hiram112
I'd actually be really interested to hear more opinions, especially with
regard to scaling Postgres.

It seems like a few companies have actively been working on the horizontal
scaling part of Postgres, along with the RDS version.

Would this be something where he could move his data to AWS RDS, change no
code, and have everything 'just work' due to AWS dealing with the scaling of
Postgres?

That seems to be their selling point, though I wonder if anyone has actually
used it with this much data.

I'm really hesitant of using a NoSQl solution, too, for any projects just
because most don't have transactions, don't have _real_ SQL APIs, etc.

------
segmondy
Redesign and fix your database. 9 times out of 10 it's poorly designed DB and
queries. 145GB means nothing How many rows? Can you partition? Can you use
smaller datatypes? Are you read heavy or write heavy? Are your indexes
designed accordingly? Have you split your DB into OLTP and OLTA DBs? Do you
have lots of triggers? Are you having the database doing lots of work which
you can outsource to the CPU? Did you actually tune your DB parameters? Is the
hardware scaled out max vertically? 145Gb is nothing.

------
borplk
Stick to Postgres and investigate how you can improve things.

I guarantee you have not reached "max postgres capability", far from it.

You will 100% regret changing the database as a solution to your problem and
you will buy yourself lots of other problems.

------
Cypher
bitcoin

~~~
iKSv2
you mean some kind of blockchain type solution ?

~~~
is_true
With a pinch of AI

~~~
iKSv2
any documents / example?

~~~
is_true
I was just adding buzzwords.

To be honest, if you haven't tried to optimize your hw/code/schema/indexes. I
think I would start with hardware, then schema and queries.

I solved a problem with a database for a finance product, but the problem
wasn't the database, it was abusing the ORM that made the resource usage go
nuts. It made sense at the time of writing the software because the deadlines
were tight, but should've been rewritten before the system started failing.

