
Ask HN: Strategies for database reliability when I have no control over the DB? - grepthisab
Working in a corporate enterprise type environment. Very crufty. Unfortunately my application environment, which is very modern, relies on an old, old corporate DB. I have spoken with the dbadmins and they are not willing to help out to make things more reliable, they&#x27;re pretty siloed off from everything else and don&#x27;t answer to anyone to help. The team has no SLOs and the best I can see is about 90% uptime. In the end this may be a policy question, but for the short term I need to manage reliability of my applications so I can hit my SLOs.<p>So from my perspective, how do I manage this? Back up the DBs in my environment as best as possible and serve traffic from there, or cache everything? Data is constantly being added to the old DB, which holds user data. I was thinking about creating a bunch of DB shards in my environment with LBs in front of them with long-lived connections to the old DB, and if the user makes a request, if the old DB is down, serve from the backup shard. Maybe sharding is too complicated for this scenario. Advice?
======
user5994461
Assuming you are in a very large organization.

1) When the system is down and you're asked for an explanation. You deliver a
post mortem putting the responsibility on the database. The database is down.
The database team is working on it.

2) If you are facing repeated major downtime. You keep track of failures
caused by the database. You make a chart with a health check to the database
every minute.

3) You report to your management that the database service doesn't meet
expectations and it's affecting your services, supported by the previous
reports you just created. It's the responsibility of the database and the
management to run their services and decide where to allocate resources.

4) Make it clear that your service depends on its database. Any question about
downtime or better SLA should be replied with the postmortem from 1 and/or the
improvement plan from 3.

All you can do is report. You probably don't have a say. Your manager probably
doesn't have a say. Maybe the database team doesn't have the power to fix
anything either.

~~~
beckler
Coming from the small business/startup world, to a giant enterprise, this is
what I struggle with the most. I'm used to digging in the dirt to figure out
an issue, and get it done as quickly as possible.

Having zero control over the infrastructure and having to work with various
teams can be a pain whenever you want to make a change, but they have a silver
lining in that production issues are usually not your responsibility.

But with their response times, you would think it's not their responsibility
either...

------
scarmig
1) SLO improvement should be a negotiation that must include management. It's
a resourcing decision. What things are enabled by it, and at what cost?
Though, TBH it sounds like even if they were mandated to achieve a certain
SLO, you couldn't trust them to achieve it or correctly measure if they're
achieving it. If it did get to that point, you would want to be closely
involved in defining how the SLO is measured.

1b) Where are your SLOs even coming from? If the db team offers no SLO, you
have no business offering one yourself, or if you do it needs to be something
that takes into account observed backend reliability, not something imposed by
a pointy-haired boss.

2) I'm skeptical about the sharding solution. If the data you're using doesn't
need to be shared across teams, consider just setting up your own db.

3) Aggressive caching and prefetching might improve the end user experience.

4) Graceful degradation. Build your UIs such that if one call to the DB fails,
it only breaks a minimal part of the page, not everything.

This sounds more like an organizational issue, though. The most value you
could provide to the company sounds like it'd be breaking through those
institutional barriers. More likely than not you'd try and fail (nature of
these things, not a knock on you), but it's worth a shot.

------
jabart
Been there. We had a central DB that everyone relied on that also would allow
ad-hoc queries from support staff. Locks were an issue and failed batch job
would leave transaction open. The entire org had to use this database, no way
around it and there was 10+ years of legacy code in various systems.

We took the simple approach, we had our own database server, so we had the
DBAs setup a one-way replication of key tables we needed. We put stored procs
and synonyms in front of those tables so in case they moved it wasn't a major
expense to us, that also helped the DBAs feel better in case they needed to
drop the replication. We also copied over any stored proc we needed to use,
pointed them at our synonyms. This also helped to buffer any unintended
consequences from an update to a stored proc. Average replication latency was
250ms, setup an alert to let us know if it went higher to let us tell the
other teams they might start having issues.

The result, the entire company could be frozen from an open transaction on a
key table, but our app could keep working. Effort level was low, results were
amazing.

~~~
3pt14159
That sounds really interesting!

One question though: Did your app not need to propagate changes back to
theirs? Like, say a customer moved from Canada to the USA and now you no
longer needed to apply HST. How did you handle data going back the other way?
Aren't you risking getting out of sync if the data is a two way propagation?

~~~
jabart
When we did need to propagate changes back, we used a service by another team,
if that team used queues, just a direct post, error to log. If they didn't, we
put up a queue, and added it to our queue processor and tried their service
for 3 days, logging each time to eventual dead letter our ops team could
retry.

------
mooreds
As others have pointed out, this is really two problems. The first is an
organizational one, and is more important to solve. Why can't you get better
uptime? Why can't the corporate db be upgraded? I think you need to explore
those issues first and figure out if there's a way forward to solve the root
issue. This should be a discussion including the dbas and whoever your SLO is
to, and probably your boss. Address these issues with minimal blame and focus
on what success looks like for the organization. Maybe the old db has been
unsuccessfully upgraded in the past and the dbas need more resources. Maybe
your app isn't as important as keeping the old db running. Who knows?

If you can't reach a political solution, then you can add a band-aid with a
technical solution. Just realize that this is only a patch and that in a few
years someone (perhaps you) will be cursing the creator for not fixing the
root issue.

You didn't specify if your app only reads from the old db, or if it writes to
it as well. Either way you'll need to stand up some intermediary. What kind of
intermediary really depends on throughput and data size, so we'd need more
information to give useful advice.

------
mratzloff
The central question to answer is whether your boss is held responsible for
failure to meet the SLA or whether you alone are.

Assuming your boss is held responsible, I would design the system to treat the
database like an unreliable API. Wrap up all operations in a DAL and interact
with a local database. Do regular transactional sync operations in the
background. In case of data conflict, determine who wins and why in each case.

Don't spend too much time on the design. Just enough to get the point across.

Bring the design to your boss and say this is what I'm thinking given that we
have an SLA but this critical dependency does not and has an observed uptime
of 90%. And then let your boss decide what to do.

If you are the only one who will be held responsible and your boss doesn't
support your design or go to bat for you, you work in a bad environment and
should find a new job.

------
thaumaturgy
I've recently had to deal with a sort of similar environment, though for
different reasons.

I ended up setting up a persistent ssh tunnel to the production server with
autossh, and then writing a miniature daemon that would regularly poll tables
in the production server, figure out recent changes, and write the changes to
a log. A secondary process would read the log and copy the changes into the
development environment.

By "regularly", I mean that it ran every minute.

autossh was used to keep the resident sysadmin from complaining about
thousands of ssh logins. It also made the first daemon a little easier to
write, since from its perspective, it was still connecting to a local
database.

This worked surprisingly well. It wasn't fiddly or glitchy at all. Anytime
there was a connection problem, things would stall for at most a few minutes
until autossh could re-establish the connection, and then the reading daemon
would happily work its way backward through the tables.

Essentially this is just building out a master -> slave database relationship,
but with application logic instead of reconfiguring the DB environment. It
didn't require a very large time investment to get it working.

------
lsb
You can do this! It is a little work.

It sounds like you are able to deploy your own infra, and it sounds like you
are not making any writes to the db.

These two factors make the app significantly easier, versus needing to hold
writes (and deal with reads to that augmented database) until the backing db
comes back online. (CRDTs are a nice way of expressing growing a database over
time, but an old old corporate DB sounds like some mess in SQL.)

The main principle is that you're going to serve queries from your own system,
and your system is informed by $OLD_DB. You're going to have, ultimately, a
`last_updated_at` attribute on everything you know.

You're going to, ultimately, stream all of $OLD_DB into $NEW_DB. If a DB query
to $OLD_DB is over HTTP, then the query bodies are a bit larger and slightly
ungainly and very easy to cache; by sticking a caching proxy in front of that
HTTP endpoint and telling it to keep EVERYTHING and serve stale content and
looking at the headers that come back, now you have database queries that can
be as stale as necessary.

------
chrisgoman
Seems like you are doing mostly reads. Implement a very basic ETL (just a dumb
copy) to a DB you control. Once the corporate DB is better, hopefully you are
just changing the connection string. If you have some control, a replicated
slave. For writes, you may have to try to connect and write, then catch any
exceptions and queue up the write for later. Once things get stable, your
queue should be empty.

------
wbl
This is a management problem. You need to say "we cannot meet our SLOs and
depend on a database that doesn't have SLOs of its own".

------
ecesena
I'd try the sharding route. Check out Vitess
([https://vitess.io](https://vitess.io)), hopefully it works in your case,
otherwise you can seek for similar alternatives. It's a proxy that 1) makes
sharding transparent, and 2) protects the db from bad queries. If you set this
up for your app, perhaps later you can convince other apps to switch to it,
improving the quality of the db overall.

A few other comments would be:

\- do you have a read replica where you can do read-only queries? If not,
that's also a good way to alleviate the burden on the primary db.

\- would cache make sense? do you read often the same data and do joins every
time? if so, I'd also consider a caching layer.

------
jlgaddis
A lot the comments here are missing the point.

This is _NOT_ a technical problem to be solved with a technical solution. It
is a management/organizational issue and should be solved at that level.

~~~
bigcostooge
But it won’t.

~~~
lostcolony
This. Everyone gets that it -should- be dealt with by management. Everyone
(else) also gets that it won't be, because that's the reality of many large
companies.

------
scosman
If you can't get the team who owns the data to meet the needed uptime, then
you probably shouldn't start the project in the first place. It's a classic
case of organizational differences in priority; solving them with tech pretty
much never works. Any caching layer will be more expensive than fixing the
root cause, and add consistency issues. As an IC, you'll feel pressure to make
it work the best you can with what you have, but sometimes the answer is it's
doomed from the start if the right folks aren't bought in.

Some tips if you have to proceed:

1) Be clear you can promise a SLO for you app, but not it's dependancies. If
the dependancies are down, some features won't work.

2) Have honest clear errors that explain the clear root cause. Cloudflare's
"Web server is down" page is a great example. It pretty clearly say's "we're
up, but they aren't".

------
protomyth
This isn't a technical problem to be solved with shards, plus you do not want
to have to answer those questions if something goes wrong (e.g. security).

What specifically did you ask for when you say "they are not willing to help
out to make things more reliable"?

If the corporate DB has 90% uptime then you need to tell your boss and
internal customers that the max limit of uptime is that 90% figure. If that is
not acceptable then your manager needs to deal with the db team manager. I am
a little confused on how an old DB only rates 90% since mainframes and iSeries
boxes are way above that. How modern your application environment is or how
crufty the enterprise db is irrelevant to any talks between teams.

------
ewjordan
If you need to write to the DB, you're hosed, and caching locally will only
make things more of a headache. If not, you might be able to improve read
uptime, but at a high cost. It's _not_ easy to keep a local DB in sync with a
remote that's only got 90% uptime, especially since it sounds like you have no
support from the DBAs (they'd ideally need to set up replication permissions
for you, and with that much downtime it's going to be a mess).

As others have said, this is an organizational issue and the best way to
handle it is to dig into the human politics.

------
wilhempujar
I agree with many of the responses here and it makes sense to design your
applications around the database limitations... But at the end of the day,
it's a service level management issue.

While it's easy to understand the bias to look for solutions without the
aforementioned DBA, it's going to take a pretty significant, collaborative
effort to define policies that works for all stakeholders.

The sooner the better! It's going to be a bit painful but I'd focus on
addressing the root of the issue at hand.

My two cents :)

------
lostcolony
We've had to deal with similar things.

Figure out how much you -really- have to sync with this database. If you can
have an alternative source of truth for your apps, it will help (and you can
then just best effort push back to the crufty DB)

If you can't make it so you have your own source of truth, determine if an
outdated truth is better than no answer. For many problems this is the case.
In that case, have all reads hit the crufty DB first, and failing that, your
cache. Have a sync operation as well (or at least, everytime you read from the
crufty DB, persist any return values to your cache). For writes, you can write
to both as well, however, during times the crufty DB is unavailable, you'll
need to decide if you should take the write to the local DB, and store the
sync operation on a queue to retry, or if you should only store it on a queue
to retry, hitting the crufty DB first, and only on acknowledgement there
taking it to your own DB (basically, in the event of a write/read pattern
while crufty DB is down, do you want it to read the last known synced value,
or the last value written, even it hasn't made it to the crufty DB).

Either way, that can cause some interesting race conditions. Make sure you
have good logging capabilities.

If neither of those is possible, if you absolutely -have- to rely on the
crufty DB as a source of truth, that it's better to return no answer than an
outdated one, you're screwed, from a technical perspective. Communicate the
resulting failures fully, be clear where the fault occurred with the
stakeholders, and that it's not something you control.

------
devonkim
Fundamentally, your database availability matters more than anything else at
the moment for your SLO. So it is important to understand what keeps the DB
team from delivering a solution that has better measured SLA. If they’re
fighting similar problems as yourself such as legacy software and the business
has made supporting that (even at the cost of torpedoing newer projects)
higher priority you need to accurately and objectively show how your
deficiencies are solely related to the organizational priorities. If the DB
team is just run poorly, understaffed, etc. it’s even more important to be a
politically aware engineer to be successful.

Is there even any application or service that exists in your enterprise that
can show high availability while using that database? Be careful to not become
the pawn of a leader that is getting thrown under the bus by incumbents by
being tasked with basically impossible duties and objectives. Even if the
person’s intentions are noble that simply increases the chances of failure and
your lessons therefore become lost (most large organizations and cultures seem
to have a poor job reflecting carefully upon failures compared to trying to
replicate successes no matter how circumstantial the nature of success).

------
thedirt0115
Disclaimer: The following suggestion only works if you only READ from their
DB. Can you roll your own DB that matches theirs schema-wise for the tables
you need, but doesn’t suck in terms of availability? If you can get that set
up, then in places where there are writes to their DB that you care about,
also write to yours. Do a one-shot migration to copy data you care about after
you land your change to double write. Then, modify your app to read from your
DB and ignore theirs. Finally, once everything has been running smoothly for a
while, collect data showing your system is better than theirs, get them fired,
and become the new DBA for your company :)

------
hullsean
This is an organizational alignment issue. your application SLOs can be at but
not above your dependent components. if that db has 90% uptime than your
application cannot be better. add an outage message that makes the cause loud
& clear

------
mschuster91
> Back up the DBs in my environment as best as possible and serve traffic from
> there

From a data protection/retention view: run a cache and use expiration in terms
of weeks. Keeping the data as "backups" floating around is a recipe for heavy
fines if someone decides to do an audit and finds some left over sqldumps.

If you do need a full backup/replication: get this written off by your
superiors so you aren't on the hook in case someone decides to do an audit and
determines you to being the one responsible.

------
babygoat
Oh man, I totally relate to your environment and dilemma.

Too complicated or not, if you can get away with it I would do the sharding
project for the fun/experience/learning.

~~~
divbait
(as long as it's not customer critical - at least setting something like this
up would be a fun weekend project :) )

------
mr_toad
Are you writing to the DB?

If not, copy the data. Reporting from a live system can cause all sorts of
problems, including not being able to reproduce results.

------
mindcrash
Capture changes at the old db and stream them to your own database (if
possible).

Here's how to do this with Kafka and Postgres:
[https://www.confluent.io/blog/bottled-water-real-time-
integr...](https://www.confluent.io/blog/bottled-water-real-time-integration-
of-postgresql-and-kafka/)

