
Migrating a 40TB SQL Server Database - ksec
https://www.tarynpivots.com/post/migrating-40tb-sql-server-database/
======
polygotdomain
I think some of the posts here miss a little bit of the context as to why
things like this happen in the first place. It's only in the last handful of
years that a stack for logging has really become mainstream. Chances are a lot
of these types of logging solutions predate that and used whatever persistence
technology was readily available. Writing to files on web servers can be a
pain, and these logs will have to be queried at some point, so storing it in a
database is not a bad idea, especially when better options have only recently
become available.

The problem is that relational SQL is bad for logs, but by the time it gets to
the scale where it's problematic that there's too much volume in the logs to
make anything "easy". Simultaneously there's a lot of business value in that
log data that you don't want to lose.

Yes, SQL's a poor fit for logs, but it's a better fit then a lot of other
things, including not logging at all. Better solutions exist, but they don't
exist in a bubble, and there's a cost to integrating them and migrating to
them. A lot of these comments seem to be judging a technology decision based
solely on hindsight without realizing that there are legitimate reasons for
logging to SQL.

~~~
jeffbee
I disagree with the premise of your statement. It's typical that a log will be
accessed zero times. Collecting, aggregating, and indexing logs is usually a
mistake made by people who aren't clear on the use case for the logs.

~~~
jrbancel
Absolutely, the vast majority (95%+) of logs are never read by a human.
Therefore, processing it is enormously wasteful. A good architecture will
write once and not touch anything until it is needed.

I spent years working on system handling 50+PB/day of logs. No database or ELK
can handle that, and even if it did it would be prohibitively expensive.

~~~
Keyframe
Where did you work? CERN?

~~~
jeffbee
It's adorable when people think scientific computing has the same scale as a
Google or Microsoft.

~~~
FridgeSeal
Sorry what was that?

[https://www.spie.org/news/photonics-
focus/mayjun-2020/square...](https://www.spie.org/news/photonics-
focus/mayjun-2020/square-kilometer-array-big-data?SSO=1)

~~~
jeffbee
Ignoring the fantasy b.s. in the second half of the article, the stuff at the
top is exactly what I mean.

A mighty 400 GB/s: i.e. much less than the > 50 PB/day of logs the other
person mentioned;

1600 hours of SD video per second: i.e. about 1-2 million concurrent HD
streams, or much less than the amount actually served by YouTube.

IBM Summit "world's most powerful supercomputer": < 5000 nodes, i.e. much
below the median cell size described in the 2015 Borg paper. Summit is a
respectable computer but it would get lost in a corner of a FAANG datacenter.

~~~
lukeschlather
CERN is a correct example. The LHC reportedly generates 1PB per second:
[https://home.cern/news/news/computing/cern-data-centre-
passe...](https://home.cern/news/news/computing/cern-data-centre-
passes-200-petabyte-milestone)

~~~
jeffbee
If you define “generates” to mean “discards” then yes.

~~~
welterde
It still gets processed though and only all non-interesting events get
discarded..

Otherwise the tape alone to store it on would exceed their total operating
budget in a day, so they have to be a bit clever about it.

------
oneplane
This reads like the most Microsoft or most legacy-on-prem problem ever. Why on
earth would you do this, and even then not take the opportunity to do it right
while you are migrating anyway? Are you mad?

Well, that was my first reaction anyway. It seems to me that this is more of a
'when all you have is a hammer, everything looks like a nail' to me; if you
are a vendor shop (i.e. Microsoft-only) and you are a DBA and specifically
just for MSSQL, then this is probably the best way you could come up with,
based on the skills, experience and tools at hand. It does make me wonder if
at any point someone thought to ask for help instead of keeping this going.

~~~
BrentOzar
> Why on earth would you do this, and even then not take the opportunity to do
> it right while you are migrating anyway?

Because as a database administrator, you often have to solve hosting problems
without telling your developers to rewrite all their code.

~~~
pc86
Which speaks to much more dire organizational issues - that nobody can throw
up a red flag and say "hold the eff up, we've reached a scale where this is
broken and we need to address it."

~~~
hn_acc_2
I would say this is actually the norm... In most orgs the leadership is not
highly technical and this is always a hard sell.

The best you can hope for after a "hold the eff up" rewrite is for everything
to keep working the same "but it's more hardened/scalable/modular/blah blah",
and the worst you can hope for is to screw up some critical business process
while the kinks are worked out.

Also, there is rarely any incentive for anyone from Joe Developer all the way
up to C level to even call for this in the first place.

------
iliasku
I found this article useful, mostly for people who don't have similar
experience dealing with large databases. It gives so many reasons on why
following this path is such a bad idea. Please people, stop writing custom
scripts and then run them untested on the production db. If the company
doesn't pay to give you enough disk space to take a proper backup to test just
quit or buy it yourself. It will save you 11 months and many hours of sleep.

------
malisper
I've done migrations of several databases with ~20TB of data before. Assuming
you're data is immutable, A much easier approach than what they did is to
restore a backup of the database onto a new machine. Since you don't have to
worry about serving production traffic you can have the machine go full
throttle on performing the migration. When the migration has finished you have
to copy the new data that came in during the migration into the new database
which isn't too bad.

When I've done this with 20TB DBs, it usually took several days to complete.
That's much better than the 11 months mentioned in the blog post.

~~~
d_k_f
I'm late to the party, but he mentioned this in the article. The new server
didn't have enough space to hold both the backup and the newly restored
database, so that approach didn't work.

Quote: "We’re a very lean shop when it comes to hardware, and this is by far
our largest chunk of data we have to store, so there wasn’t a place anywhere
on the network to store a 33-38TB full backup temporarily, while it was being
restored to a server."

As others have mentioned in the thread, a lot of the problems seem to have
been solvable by just buying/attaching more storage, even if only temporarily.

------
rrdharan
It seems concerning that they would be hanging onto traffic logs for that
long. Seems like a massive PII / data breach liability.

~~~
tutfbhuf
Agreed, it's questionable why a company needs to save raw traffic logs for
years.

~~~
bryan_w
You say that but people were pretty upset that Google couldn't confirm if
anybody exploited the G+ security bug past the last two weeks that it was
discovered.

~~~
m3047
+1 for them. Pretty cynical, I know. But imagine the request for legal
discovery...

------
posix_me_less
> _" We realized that the original daily table structure wasn’t ideal. If we
> needed to query something over several days or months, it was terrible —
> lots of UNION ALLs or loops to crawl through days or even months at a time
> was slow."_

Seems to me rather storing traffic logs in an SQL database is a very bad idea,
especially if you have 40TB of it. Anybody can explain why SQL database makes
sense for large time series data? Why not just dump it into linear files?

~~~
BrentOzar
> Anybody can explain why SQL database makes sense for large time series data?

Let's say, just to say, that you have effectively free licensing for Microsoft
SQL Server (because you already own the licenses in question), and you have
staff who know that platform well.

Sometimes it's easier to use the tools you already have rather than go acquire
a new platform to achieve a goal.

As with anytime you choose a persistence layer, you're accumulating technical
debt the instant you make the choice. Over time, if better persistence layers
become available, you have to make a judgment call about when it makes sense
to change the code, versus when it makes sense to do maintenance work on the
persistence layer.

~~~
slg
>and you have staff who know that platform well.

People often ignore how important this is, especially for a small team like
Stack Overflow appears to have. A new platform requires training and/or a lack
of productivity as employees learn the new system. It requires developer time
to convert over any work designed to use the old system. It makes hiring more
difficult because now you will either want someone with the knowledge of both
your systems which greatly shrinks down your pool of potential hires or new
employees will need to learn a new system on the job which increases the ramp
time up for any hires. Maybe this doesn't fit the textbook definition of
technical debt, but it is still a real and ongoing cost that comes with making
a decision to switch tools.

~~~
gbrayut
Case in point, one of my favorite system architecture/design blog posts:

Providence: Failure Is Always An Option

[https://jasonpunyon.com/blog/2015/02/12/providence-
failure-i...](https://jasonpunyon.com/blog/2015/02/12/providence-failure-is-
always-an-option/)

Which has one of my favorite quotes about scope creep:

Kevin and I have essentially become allergic to big projects. We attempt to
practice “What can get done by Friday?” driven development. Keeping things
small precludes a whole class of errors like “We need a new datastore”, ‘cause
that ain’t gettin’ done by Friday. It’s hard to sink a week on something you
weren’t supposed to be working on when all you have is a week.

------
dswalter
According to their pricing page, a 50TB SSD from Nimbus costs $12,500. That
would have been worth the expense in DBA time alone, right?

Edit: added the link.
[https://nimbusdata.com/products/exadrive/pricing/](https://nimbusdata.com/products/exadrive/pricing/)

~~~
BrentOzar
> That would have been worth the expense in DBA time alone, right?

You're assuming the DBA did nothing else during the project's timeline, and
that's not likely the case. A lot of work in projects like this is just
letting the batch job run while you keep an eye on it.

------
miahi
I worked on a system that generated around 500GB-1TB of log data monthly. An
estimation done way before it was put in production said that it would
generate less than 100GB/month, so the DB storage was sized accordingly (~6TB,
half of which were used for other data). A change that came later in the
project (on a different system) was disregarded, but it increased the
application traffic by a factor of 5, and the space estimation was not
updated. So it was an interesting surprise when the space started filling up
way faster after the go live.

To make the issue even more interesting, the system was mission-critical at
five nines and the log data was sensitive + had a legal requirement to be kept
online for at least one year, for queries. So deletion was not an option and
the migration had to be done online.

Fortunately the system used Oracle DB and the log table was partitioned by
date. Unfortunately, a partition kept three months of data, so it could reach
the full amount of free space on the DB. There was no easy way to add storage
to that server, but we did have some space on another one, with lower speed
(the prod DB was RAID10, the other one was RAID1)

In the end we had to create a new partition scheme, with monthly partitions
stored in different tablespaces, create another table with the same structure
and switch it with the online one (a couple of seconds of downtime), backup
the database and restore it on the slow storage, then delete the old table
structure and tablespaces/datafiles to free the space on production.

A procedure was created to detach the oldest partition from the production
database and reattach it to the slower database (where we also implemented
compression at a later stage). This meant that the query system needed changes
to work with both databases, but at least the system was up.

The good thing was that in Oracle you can detach partitions and tablespaces
and transport them to a different database without downtime. I don't know MS
SQL enough, but I understand from the migration procedure that this is not
possible there. I guess they did that manual daily partitioning by table, but
they did not store each table in its own datafile, so it can be moved as a
whole (maybe with downtime, but at least it would free the space faster).

------
WorldMaker
Bulk copy is about the fastest you could do in C# or PowerShell, but you would
still be moving a ton of data back and forth over the wire between the DB
server and the "jump" server running the script.

I would have tried a hybrid approach with the dynamic SQL one. Dynamic SQL
sprocs like that don't seem to generate good query plans (at least not with a
lot of prodding) in my experience, but if the dynamically generated SQL was
done in the C# or PowerShell side rather than a sproc I feel the query
analyzer would be less confused about individual queries coming out of the
scripts.

The only other advice experience from a past life I have to offer is that I
often saw better performance from T-SQL MERGE than T-SQL INSERT FROM.
Dynamically generating MERGE statements from C# is not fun, but in that past
experience it would give the best results for single-server data migrations
like this. (Multi-server there's not much you can do beyond Bulk Copy.)

------
champtar
If they want to migrate to something else, they need to have a look at
ClickHouse. When switching from Elasticsearch to ClickHouse 1.5 years ago, I
reduced my storage needs by 20, gained SQL, performance, and a ton of
analytics features.

In hindsight I would say that Elasticsearch is for full text queries, and if
you are using it for something else (access logs) there is a good chance this
is the wrong tool for the job.

------
user5994461
This is all the reasons why logs must have a limited retention period (usually
1 to 3 months).

Then nobody has to waste 11 months of their life trying to retain and migrate
TB of old logs that nobody needs, with no storage to do so and no backup.

------
jeffbee
Storing web logs in a SQL database is a pretty bad plan.

~~~
tyingq
Combined with _" The database was initially designed to have a single table
for each day."_ Oy.

~~~
cm2187
Yeah, if they do that they might as well shard databases and not have this
problem in the first place (not that there aren’t better tools)

------
myrandomcomment
So this is all interesting from a technical point of view but from a business
point of view it makes no sense. The amount of time that you had a talented DB
dev working on a project that could have been solved by spending ~25K on a new
system seems silly. Bring the new system up and move the data over the network
doing the conversion as part of the move. 100TB over a 10G network is ~22H
theoretical transfer rate.

------
metadata
I think 11 months is absolutely crazy, and the context for that is that I earn
for living by selling my database migration software [1]. They could have used
my software to copy up to 16 tables in parallel for a mere $3k. They could
have copied a few hundred tables at a time, compare source and target to make
sure everything is fine, then drop and compact. As more disk space frees up,
they could have moved more and more data in one go. If the data needed to be
transformed much, they could have wrote views to pull the data in a proper
format and creating tables on the target with that data as they go. True,
spinning drives would kill performance, but my tool is easily reaching 50MB/s
on SSDs and spinning drives wouldn't add months. I mean, when you have off the
shelf products doing what you need, spending so much time fighting with custom
scripts is and odd choice. This is really not meant to advertise, I'm just
baffled by how much effort was needed for this project.

[1]
[https://www.spectralcore.com/fullconvert](https://www.spectralcore.com/fullconvert)

------
hansitomani
Wow.

I really wana know if they even compress the log data in there database.

40 TB of logs in a postgresql, like that can't be efficient?

I would probably just have a scaling blob store and would store it away. That
has to be much faster, simpler and cheaper.

~~~
greggyb
They mentioned in the article that they are using SQL Servers clustered
columnstore index. This is a poor name, as it's more than just an index. CCI
in SQL Server is an on-disk storage format that is a highly compressed
columnstore.

------
dkdk8283
I run a 30TB MySql server on EC2. We raid 0 multiple EBS volumes for space and
iops.

It is completely unmanageable - I inherited it. We’re moving data out, but
slowly. This DB houses PII for tens of millions of users.

~~~
techntoke
So you're paying roughly 72,000 per year? Or are you using SSDs?

~~~
dkdk8283
gp2. master and multiple read slaves. we we pay millions per year for AWS and
support.

~~~
techntoke
Can I submit a proposal?

~~~
dkdk8283
Sure

------
jto1218
> There were lots of reasons this needed to be done, one being tech debt. We
> realized that the original daily table structure wasn’t ideal. If we needed
> to query something over several days or months, it was terrible — lots of
> UNION ALLs or loops to crawl through days or even months at a time was slow.

Maybe i'm missing it, but there doesn't seem to be any discussion of the
result. _How_ much faster was it after the change? Was it 11-months-worth-of-
dev-time faster?

~~~
BrentOzar
> Maybe i'm missing it, but there doesn't seem to be any discussion of the
> result.

Sometimes you have to stop writing after 9,000 words. ;-)

------
thisisbrians
I'm in the process of migrating a ~5.4TB time series database. This is also
slow/painful. I wonder if there are any general-purpose/open source tools that
could help with these sorts of tasks (I've had to roll my own job queue/ETL
worker script infrastructure to manage this). In my case, it looks like the
full migration will take around 10 calendar days assuming no issues come up.

~~~
user5994461
The trick is to find a common baseline format, like CSV or JSON, then you can
leverage the bazillion of existing tools.

The second trick is to split the data in manageable chunks (a day of logs for
example), then you can transfer one unit at a time to the new system.

Dump a chunk on one side, load on the other side, verify it's good. It will
still take a while but you have a good indication of progress and don't need
to start all over if one chunk fails.

------
yknx4
I'm baffled with the fact that they were toying with the production database
in the production machine. That sounds incredibly dangerous

~~~
BrentOzar
> I'm baffled with the fact that they were toying with the production database
> in the production machine. That sounds incredibly dangerous

As Taryn stated repeatedly in the post, they simply didn't have enough
infrastructure to have development spare copies of this. They didn't even have
enough infrastructure to take a backup.

~~~
adfinoinio
Which is obviously nuts. If your developers are ever spending months of their
time to work around infrastructure limitations, that means you need more
hardware. You could build a new server to handle this job easily for much less
than the cost of this person's time.

~~~
Shog9
Unfortunately, the exchange rate for converting slivers of dev time into
hardware is... Unfavorable. And up until two days ago[0], Stack Overflow was
running on fumes - shedding staff, perks, equipment to keep the ship afloat. I
know of folks who spent months scrambling to get new servers before the old
ones died, and... Didn't quite make it.

Given the situation, investing time into making do probably seemed like a much
better bet than hoping for a windfall with which to buy equipment.

[0]:
[https://news.ycombinator.com/item?id=23976881](https://news.ycombinator.com/item?id=23976881)

------
idlewords
I'm sitting in a hotel waiting for 22 TB of data to copy over, so I could not
have read this piece with more interest. Moving data in these quantities is
like pouring tar, and it's always great to hear war stories from people who
have done it. It is always more painful than message board theory would
suggest.

------
nwmcsween
Someone really needs to make an anomaly filter for logs so data storage
becomes miniscule for the purpose of anomaly detection, one Jacky way would be
to preload a compression dict with bad strings from the binary, another would
be to use something like hyperminhash to spit out low similarity strings.

------
jiggawatts
I can't imagine why he'd use LINQPad when it's possible to get two SQL Servers
to talk directly to each other. Just configure a linked server and write an
INSERT statement pulling the data from the other server.

------
bluedino
Hah. My first thought was, 'what kind of idiot lets a file get to be 44TB...
Then I realized who it was. Nice to see that even StackOverflow had an issue
like this.

------
tuananh
I'm surprised why the author didn't think of availability group earlier? maybe
i miss sth from the post?

------
muzaffarpur
I wonder how effective it would move data using spark? Have anyone considered
that!

------
bjg2
any reason why creating a table per day / month i stead of using one table
partitioned by day / month?

------
bjg2
any reason for creating a table per day / month instead of a single table
partitioned by day / month?

