
Scaling Bitbucket’s Database - WalterSobchak
https://bitbucket.org/blog/scaling-bitbuckets-database
======
adamfeldman
"Bitbucket uses PostgreSQL with one primary read-write database and N read-
only replicas."

------
hinkley
I am architecting a project that is at a smaller scale but bears some
resemblance, and I’ve already settled that I will be very careful about GET vs
POST/PUT semantics and route all traffic of the former only to the replicas
and all of the latter through a connection pool to the primary. That, I
expect, will hold me well into >1% of their traffic.

I’m not sure if I’ve missed something with their LSN work or if this indicates
that the GET semantics horse has already escaped their proverbial barn. Within
the call-response of a single request, which they seem to be talking about,
none of this should be necessary. Right?

However a cascade of narrowly-spaced follow-up requests could easily catch you
in this trick. Lie. Whatever you wish to call it.

------
cryptonector
Bitbucket is a perfect case for sharding. You could have one DB of users, and
N DBs of user data, with the first referencing the latter, and then distribute
user data among the latter. You could do this straight up with PostgreSQL
without any special server-side software. You could also have a PG server
using FDW to act as a proxy for all the DBs a client needs to be talking to.
There are many other options too.

------
zzzeek
Bitbucket (mostly via Atalassian) worked very hard to chase everyone off their
platform in the vain hopes that the OSS community would start using Jira.
Clearly they bet wrong, but unfortunately they aren't going to get any users
back for a very long time.

~~~
alexis_fr
Free private repos is a real upside.

Central login where you can never understand what login you are using and what
information they are collecting, nor on which website you are not whether
you’ve suddenly integrated Confluence into a Bitbucket instance, is a real
downside. God I hate SSO’ed companies, they went all the way of Google into
Youtube.

~~~
detaro
Free private repos are an upside over ... Github, which gives free accounts
unlimited private repos? Gitlab, which does the same?

~~~
Cyph0n
To be fair, Github only added that recently.

------
caseyf7
Does storing the LSNs in Redis and Elasticache create the same race condition
they are trying to avoid or am I missing something?

------
nihil75
Yea I'm not taking advice from a service that's down once a week

~~~
izacus
Did you actually attempt to read the article or did you just come here to spew
out needless negativity?

~~~
chairmanwow1
Have you ever actually used Bitbucket in production? It has absurd latencies
for every operation from website to even a git push taking full seconds to
respond.

It’s hard to take them seriously when that’s the tableau upon which they are
advising us to build systems.

~~~
skunkworker
A number of years ago I had all of my repos on bitbucket, but recently I’ve
pulled everything as their peering can impact large git pulls and were
unreliable. The transition with the account changeover was enough to make me
jump ship to github completely.

------
rosstafarian0
Is it just me or does looking at write logs to determine read consistency just
seem like a really bad way to do this vs a different load balancing strategy?

~~~
hilbertseries
Definitely feels like bandaid, over a more robust solution like sharding their
main database by repository. Or splitting off some of their tables to another
database. Their load of "millions of requests per hour", doesn't sound
particularly high either.

At some point their primary is going to get overloaded again by the writes.
And they'll have added all of this machinery for nothing. They've also made
the replicas an essential part of their stack. Whereas their previous stack
could probably tolerate a certain amount of replicas going down or replica
lag. They now have a system that will grow and be dependent on all the
replicas being available. Until a heavy write load or heavy table alter,
causes the replicas to become lagged at which point a higher percentage of
traffic will go to the master and potentially cause downtime.

~~~
pm90
Yikes. Excellent point. Is sharding the only option here? It seems like having
immensely shared Postgres is one way of solving the problem, but is that the
only option?

Wondering if eg cockroachDB would be a good fit

------
ykevinator
It's amazing that Atlassian still exists.

------
lykr0n
Curious that they use LSN tracking vs making the replication synchronous. I
would be curious to see the performance numbers and the reasoning behind the
choice.

~~~
beoberha
Synchronous replication really slows down write performance and adds in the
complexity of needing to determine if a replica is down. I think very few
people use it in practice.

~~~
Xorlev
+1, sync replication also reduces availability: any replica downtime becomes
write unavailability until you decide that replica is "dead" and stop
replicating to it.

There's also hybrid setups of sync+async for durability reasons, sync to a
single replica so that if your primary db suddenly catches fire you have a
replica that's as up to date as possible, then read replicas using async
replication.

------
saga92
Oracle offers DML redirection for active dataguard setup where a write
operation on the replica database is redirected to the primary database to
allow applications that make infrequent writes to actively run on the Active
Data Guard replica database. Also the write operation completes when the
replica has seen the write from the primary thereby eliminating the race
condition avoiding such complex reengineering

~~~
rosstafarian0
A lot of databases have a concept of consistent reads. Even DynamoDB will
allow you to specify consistent reads explicitly on a query and writing
through DAX will take care of the scenario as you described as the read cache
is immediately updated. This a common scenario.

------
loic-sharma
That's really neat! I wonder, could you store the PostgreSQL log sequence
numbers in a cookie to save the round-trip to Redis?

~~~
nhoughto
I thought this too, but that then creates a likely unreasonable burden on the
client. Not all user agents expect cookies either, think of all the varieties
REST clients. Sounds like they would also need to internally reference the lsn
too, so being able to look it up internally would be valuable.

Feels like there is something there tho in terms of approach, esp since they
are keying it on user id.

------
halestock
I found it interesting that their cloud version of bitbucket uses Django while
their self-hosted stack is almost entirely Java.

~~~
mithun
That's because they bought Bitbucket. The self hosted version was internally
developed and used to be called Stash, before they rebranded.

------
sedatk
Step 1: Remove Mercurial support.

------
ngrilly
> we find a replica that is as up to date as the user's saved LSN

The article doesn't explain how they know the current LSN of each replica?

~~~
avremel
IIUC they need to query replicas until they find one that is up to date.

"We have to query Elasticache and replicas, which adds latency."

------
qeternity
10ms to do a redis get? What am I missing here...

~~~
Xorlev
They said that was their time budget, but didn't follow up with how expensive
it ended up being in practice -- but agreed.

~~~
qeternity
The article says they were able to keep it close to their max budget, which
actually suggests that it was higher than 10ms in practice

------
doubleunplussed
Still salty about them dropping mercurial support without providing any
migration tools, and planning to just delete data without any archiving.

~~~
meritt
I'm also salty. What's your plan? I'm either gonna migrate to git (and use a
different host), or self-host something like hetapod.net

~~~
bjoli
I went to sourcehut. I am very happy, but during the process I ended up
actually having to learn a bit of git. Future projects will probably be git.

