Hacker News new | past | comments | ask | show | jobs | submit login
Migrating a 40TB SQL Server Database (tarynpivots.com)
126 points by ksec on July 30, 2020 | hide | past | favorite | 120 comments

I think some of the posts here miss a little bit of the context as to why things like this happen in the first place. It's only in the last handful of years that a stack for logging has really become mainstream. Chances are a lot of these types of logging solutions predate that and used whatever persistence technology was readily available. Writing to files on web servers can be a pain, and these logs will have to be queried at some point, so storing it in a database is not a bad idea, especially when better options have only recently become available.

The problem is that relational SQL is bad for logs, but by the time it gets to the scale where it's problematic that there's too much volume in the logs to make anything "easy". Simultaneously there's a lot of business value in that log data that you don't want to lose.

Yes, SQL's a poor fit for logs, but it's a better fit then a lot of other things, including not logging at all. Better solutions exist, but they don't exist in a bubble, and there's a cost to integrating them and migrating to them. A lot of these comments seem to be judging a technology decision based solely on hindsight without realizing that there are legitimate reasons for logging to SQL.

Totally agree. I would be curious if anyone who is saying why didn't you use XYZ has actually migrated 500 billion rows of queryable log data from one tech to the other. For everyone complaining about PII. The article shows a 6 year log retention which fits with many regulatory requirements for data retention. We forget that a lot changes in 6 years on what makes sense to do. ELK stack according to elastics history page became a real thing in 2015.

I've done with a hundred billion rows. It's not too difficult but it takes a lot of time. Note that elasticsearch scales really easy and really well, I have an article on a small setup many years ago for 12 TB in a small company https://thehftguy.com/2016/09/12/250-gbday-of-logs-with-gray...

The challenge is, a migration would require a lot of hardware resources (storage mainly) to setup a new cluster and the poster couldn't get one damn hard drive to work with, so any migration would have been a death march from the start.

Knowing that and that the original system was out of disk and had no backup of any sort. I would personally question the effort to try to keep all these old logs.

I disagree with the premise of your statement. It's typical that a log will be accessed zero times. Collecting, aggregating, and indexing logs is usually a mistake made by people who aren't clear on the use case for the logs.

Absolutely, the vast majority (95%+) of logs are never read by a human. Therefore, processing it is enormously wasteful. A good architecture will write once and not touch anything until it is needed.

I spent years working on system handling 50+PB/day of logs. No database or ELK can handle that, and even if it did it would be prohibitively expensive.

Where did you work? CERN?

It's adorable when people think scientific computing has the same scale as a Google or Microsoft.

Ignoring the fantasy b.s. in the second half of the article, the stuff at the top is exactly what I mean.

A mighty 400 GB/s: i.e. much less than the > 50 PB/day of logs the other person mentioned;

1600 hours of SD video per second: i.e. about 1-2 million concurrent HD streams, or much less than the amount actually served by YouTube.

IBM Summit "world's most powerful supercomputer": < 5000 nodes, i.e. much below the median cell size described in the 2015 Borg paper. Summit is a respectable computer but it would get lost in a corner of a FAANG datacenter.

CERN is a correct example. The LHC reportedly generates 1PB per second: https://home.cern/news/news/computing/cern-data-centre-passe...

If you define “generates” to mean “discards” then yes.

It still gets processed though and only all non-interesting events get discarded..

Otherwise the tape alone to store it on would exceed their total operating budget in a day, so they have to be a bit clever about it.

I think that even Google can not save 1PB/s in 2020.

The numbers are not fantasy at all - this will be a huge radio telescope - one square kilometer of pure collecting area and thousands of receiving antennas (For reference: Arecibo has around 0.073 km^2). We are talking data input to the correlator on the terabit/s scale. And technology-demonstration with ASKAP are well under way. ALMA is working quite well by now as well (> 600 Gb/s with just an 50 antenna array).

it’s adorable how proud you are to have worked at FAANG and how angry you get at the idea some other organisation handles equivalent scale


400GB/s is about 35 PB/day

not quite as big a difference

So, Youtube puts whole streams into their logs? Interesting.

Indeed, I mostly stop looking at logs once I get the metrics from mtail in prometheus/grafana.

What is the use case for logs?

There isn't a universal one. If you don't have a concrete one in mind, you shouldn't produce the log at all.

I appreciate the zen-like nature of this advice, but I think you also know how unreasonable it is most of the time, unless by 'concrete' you allow something as vague as, "troubleshoot production issues".

Ad-hoc production troubleshooting is a reason to keep, at most, 7 days of logs. Usually you want the most recent minute or hour. Troubleshooting usually does not need collection, aggregation, and indexing because either the problem is isolated to a host or the logs of a single host, pod, or process are representative of what is happening in the rest of the fleet. Even if you want to access all logs, it's still better to leave them where they were produced and push a predicate out to every host; your log-producing fleet has far, far more compute resources than your poor little central database, no matter how big that DB is.

What a bunch of odd and arbitrary statements. Examples: I often use logs older than 7 days for troubleshooting. I rarely troubleshoot only using last minute or hour data. I need aggregation most of the time when troubleshooting. I also treat most runtime environments as cattle so relying on it to keep logs locally would be wrong.

All but trivially reproducible bug reports require, or benefit immensely from, logs about the transaction in question. The pipeline from support to product to engineering to an actual investigation is usually much more than 7 days.

May I ask what kind of production environments you have in mind? Are these large-scale FAANG-style deployments or something else?

Well, only to the extent that the management of small amounts of logs is not very interesting. There is not an ACM SIG for very small databases.

Anyway, GPDR requires you to have a purpose for any log that contains any IP address. Keeping logs for undefined purposes and unlimited time frames is not ok any more.

It could be an e-commerce site, which depending on the case can produce a shitload of logs. Imagine having a few hundred of thousands of users daily and you record every page they view, along with heatmaps, and whatnot. In most cases those logs should never be touched by a human in raw format. You feed them in your analytics engine, and start making decisions about your conversion. And then you delete them because the goalpost is constantly moving.

To be fair, a lot of problems came and come from 'let us store this stuff just in cases'. It also helps in some unforeseen cases, but in both scenarios you essentially end up in an unstructured unknown situation which is generally not what you want in IT, Business or as a person.

Logging to a database is probably a mistake if you haven't thought about how you'd use that data after you write it to disk. If you have good ideas about how you'd want to use that data, then it is probably a fantastic idea.

We log to SQL so that we can instantly obtain a full list of log entries that pertain to a specific user action trace id. We can go from a collection of user actions over a larger business process and then for each action we can pull all of the log entries that were generated. All of this is exposed through a nice web interface with full text search capability over the log messages. Without logging to SQL, we would have a hell of a time building something similar.

In order to keep this from exploding out of control, we have a strict 90 day expiration policy on all persisted business state and log entries. Our log table indicies are:

- User Action Trace Id (16 bytes)

- Timestamp (64 bits)

- Message (Variable - FTS)

We store all log entries in a single SQLite database (logs.db) contained in each environment. These are queried over HTTP from centralized management tools.

We use similar setups but with ElasticSearch instead of an RDBMS and we use cluster connections to query against disparate environments at the same time. They are not allowed to communicate to each other, but a 'placeholder' or 'empty' cluster that is allowed to talk to the others can give you exactly what you want for analysis and auditing. Another big benefit is that environments and other 'boxing' formats are isolated and cannot influence another. Same with lifecycle management, they do their own hot/cold/archive/deletion.

We usually end up querying based on session signatures (which is something like user agent + JWT), on plain user ids or on trace ids. This gets you a pretty neat timeline with references for full records if you need to drill down on a specific event. Makes everything super fast and when you need more data or want to read PII you can drill down (depending on your access roles).

> I think some of the posts here miss a little bit of the context as to why things like this happen in the first place.

There were always a bunch of bad decisions and a few good one in hindsight. I think it's completely fair to discuss solutions (including more recent tech like ELK) to the problem even more so if the problem still exists.

> Yes, SQL's a poor fit for logs, but it's a better fit then a lot of other things, including not logging at all.

No I disagree, not logging 40 TB of data to SQL would have been at least an equally good if not better choice, based on the input we have from the blog post. Just keep the past few weeks of raw traffic logs and be fine.

This reads like the most Microsoft or most legacy-on-prem problem ever. Why on earth would you do this, and even then not take the opportunity to do it right while you are migrating anyway? Are you mad?

Well, that was my first reaction anyway. It seems to me that this is more of a 'when all you have is a hammer, everything looks like a nail' to me; if you are a vendor shop (i.e. Microsoft-only) and you are a DBA and specifically just for MSSQL, then this is probably the best way you could come up with, based on the skills, experience and tools at hand. It does make me wonder if at any point someone thought to ask for help instead of keeping this going.

> Why on earth would you do this, and even then not take the opportunity to do it right while you are migrating anyway?

Because as a database administrator, you often have to solve hosting problems without telling your developers to rewrite all their code.

Which speaks to much more dire organizational issues - that nobody can throw up a red flag and say "hold the eff up, we've reached a scale where this is broken and we need to address it."

I would say this is actually the norm... In most orgs the leadership is not highly technical and this is always a hard sell.

The best you can hope for after a "hold the eff up" rewrite is for everything to keep working the same "but it's more hardened/scalable/modular/blah blah", and the worst you can hope for is to screw up some critical business process while the kinks are worked out.

Also, there is rarely any incentive for anyone from Joe Developer all the way up to C level to even call for this in the first place.

Unless the developers are HAProxy itself and made a module for HAProxy to log in to SQL databases (which it doesn't do by default) this is not even remotely related to the developers.

Plot twist: it does support SQL out of the box!

HAProxy logs syslog messages, in a configurable format and to a destination of your choosing. All you have to do is to setup your favorite logging agent between fluentd/logstash/syslog-ng to listen on syslog and forward messages to a SQL database.

I know ;-) Hence my lawyer-ish choice of wording (default, module). HAProxy has some real nice features, especially the 1.8+ versions. We used to have configuration management tooling reconfigure and reload it, later on it get hot-reloads which was nice and then it got consul support so we could just feed consul and it all would work like magic. Love it.

My point was more about logging something straight in to a RDBMS wasn't much of a default at any point in time. Earlier on it was too expensive, and later on better systems came along. Some systems have built in text analysis and time series data specifics, but why would you worry about those when you have ElasticSearch and Kibana for most generic setups. Even before that you have syslog and a lot of log parsing libraries that worked great (well, unless you used windows event logs which has pros and a lot of cons).

How would you 'do it right'?

I've only done a few 30+TB moves, largest being 50TB which had similar table structure - one table per machine holding calibration readings that were constantly streamed in from a MUCH beefier RabbitMQ instance. Old CPU, not a ton of cores, spinning disks, etc. Same base problem. I think the best thing to do here would be change the order of some things around.

1. Buy the hardware first with blazing drives and roaring CPUs and get it stood up before you start trying to change your schema on a potato

2. Do his last step now: set up an AG and let SQL Server do it's thing

3. Now go start planning your ETL/schema changes and writing scripts while you wait 1-3 days (or in my case 10 as we had to suspend during business hours) for data to migrate.

4. Fail-over and promote and start collecting data on your superbox

5. If the world didn't explode, you do the same thing at your other DC

6. Now, if you absolutely must, apply and test all your schema changes in production instead of dev first but enjoy the fact that those changes are exponentially faster!

You could also use log shipping, but with those restore times you'd have data loss I think. Haven't done that in years - I've been lucky enough to only have to deal with projects where we shutdown, restore, and do a full downtime migration. So much nicer...

Having step 1 of your solution be "eliminate the hardware constraints that made this job painful" is kind of a cheat. We can all do a better job on any problem if we lift those constraints.

Hardware is way cheaper than 11 months salary + lost opportunity cost.

I think we all agree on that, but the question was not whether this was an efficiently managed migration. I think we can assume OP made a similar argument up the chain (he got them to add SSD storage, at least).

We did it by removing all non first-party access and turning the only access into a single-owner API service. Then we started partitioning off certain calls in to a cache, i.e. when someone decides they want all possible header combinations over a time period they don't get instant results but instead get a job ID and they can either get an event trigger when the results are in or they can poll for status if that is what they like.

Next, we get the cache to write to ElasticSearch and on lookup it would first check if memcache has a pointer to ElasticSearch already and if not it would spawn the job, copy the input set to ES, run the query there and add a pointer in memcache.

This meant that in practise you'd retain the same functionality you had before while not having a hard dependency on the backend.

Next, we scaled op the API service and the cache service to a very high number and started querying for time ranges that were known to not have pointers in memcache. Depending on the amount of spot instances we could get for the price we wanted and the saturation of the 10GbE DirectConnect we could do up to 400MiB/s, 29TiB took us about 3 days because the DirectConnect was also used to other traffic that had a higher priority and the back pressure system would reduce the API request count accordingly (using Istio).

In the end the users had only two changes:

1. No more 'talking to the database', you have to use the API which started out rather restriced

2. Some analysis takes a little longer during migration due to the SQL-ES copy action; after the MSSQL stuff was all gone the queries were much faster than before

I found this article useful, mostly for people who don't have similar experience dealing with large databases. It gives so many reasons on why following this path is such a bad idea. Please people, stop writing custom scripts and then run them untested on the production db. If the company doesn't pay to give you enough disk space to take a proper backup to test just quit or buy it yourself. It will save you 11 months and many hours of sleep.

I've done migrations of several databases with ~20TB of data before. Assuming you're data is immutable, A much easier approach than what they did is to restore a backup of the database onto a new machine. Since you don't have to worry about serving production traffic you can have the machine go full throttle on performing the migration. When the migration has finished you have to copy the new data that came in during the migration into the new database which isn't too bad.

When I've done this with 20TB DBs, it usually took several days to complete. That's much better than the 11 months mentioned in the blog post.

I'm late to the party, but he mentioned this in the article. The new server didn't have enough space to hold both the backup and the newly restored database, so that approach didn't work.

Quote: "We’re a very lean shop when it comes to hardware, and this is by far our largest chunk of data we have to store, so there wasn’t a place anywhere on the network to store a 33-38TB full backup temporarily, while it was being restored to a server."

As others have mentioned in the thread, a lot of the problems seem to have been solvable by just buying/attaching more storage, even if only temporarily.

Similar thought here. Was kind of surprised they didn't go with some kind of checkpoint/snapshot approach for the data sets, + stream them to the new machine, + add any newly arrived log data.

Yeah I only scanned through but there seem to be better tools for the job in the environment. SSIS is scriptable/has APIs so you can shred the data into n packages and run it. Also, columnstore_archive gives amazing compression if you're low on the read side and immutable data means you can load into partitions. Column diffs are a bit annoying but not unmanageable. Then again, arm chair architecting isn't ever as accurate as you assume. I do know you can move billions of rows daily using SSIS across thousands of tables.

It seems concerning that they would be hanging onto traffic logs for that long. Seems like a massive PII / data breach liability.

Agreed, it's questionable why a company needs to save raw traffic logs for years.

You say that but people were pretty upset that Google couldn't confirm if anybody exploited the G+ security bug past the last two weeks that it was discovered.

+1 for them. Pretty cynical, I know. But imagine the request for legal discovery...

Yea, it seems like this project should have started with a cost/benefit analysis. Like what is this data used for, who is using it, how often and how to quantify the benefits the users derive from it.

Not sure which the data couldn’t just be compressed into daily summaries in the DB, while the raw logs are stored (Compressed) in the file system in case you ever had to go back.

Came here for this comment. 4 years of traffic logs... is just plain brutal.

Stack Overflow HAproxy logs likely don’t have too much sensitive information. These are most likely logs about visited pages with timestamps - maybe with client IP addresses.

Storing them raw can make sense. You might later come up with new ways to analyze the data which your aggregation logic of choice does not support.

Right here at the beginning:

"Our HAProxy1 logs aka traffic logs, are currently stored on two SQL Servers, one in New York, and one in Colorado. While we store a minimal summary of the traffic data, we have a lot of it. At the beginning of 2019, we had about 4.5 years of data, totaling about 38TB. "

and I asked myself, why the hell would you want this data?

Logs serve two purposes:

- find a fault and fix it - audit what someone (or many someones) did

You never need years of data to find evidence of a fault: either it's available in the last few weeks or you didn't log the data in the first place (unless, of course, it's a bug that only shows up on leapdays, or leapseconds, or is similarly rare...)

So that leaves tracking your users in a way which isn't covered by your user's own history. Again, a month or two might be helpful, sometimes... but why years?

> unless, of course it's a bug that ... is similarly rare...

I'll give an example: a bug in the revision ordering logic that could be triggered by two edits being submitted almost simultaneously where one of the editors had made another edit within the past few minutes. The bug resulted in rows being interleaved between the two revisions and the post being rendered as a combination of two unrelated edits.

Almost impossible to reproduce. Affected a low double-digit number of posts out of millions edited over the course of several years.

Was able to finally nail down the scenario by pulling logs for all editors involved in each of the events over the years and identifying the common factor.

(You can argue this is a crazy way of storing revisions, but that's beside the point; at this scale even very unlikely bugs do show up with some frequency and it is nice to be able to investigate them)

"at this scale even very unlikely bugs do show up with some frequency" exactly! In the past I've worked on systems where "this will never happen" occurs because "this will never happen" really means the probability of this occurring is extremely low. But when you're dealing with millions of possibilities that low probability event starts occurring every day. And when it occurs for one of your largest customers it's no longer "this will never happen" but "why is this happening?"

> "We realized that the original daily table structure wasn’t ideal. If we needed to query something over several days or months, it was terrible — lots of UNION ALLs or loops to crawl through days or even months at a time was slow."

Seems to me rather storing traffic logs in an SQL database is a very bad idea, especially if you have 40TB of it. Anybody can explain why SQL database makes sense for large time series data? Why not just dump it into linear files?

> Anybody can explain why SQL database makes sense for large time series data?

Let's say, just to say, that you have effectively free licensing for Microsoft SQL Server (because you already own the licenses in question), and you have staff who know that platform well.

Sometimes it's easier to use the tools you already have rather than go acquire a new platform to achieve a goal.

As with anytime you choose a persistence layer, you're accumulating technical debt the instant you make the choice. Over time, if better persistence layers become available, you have to make a judgment call about when it makes sense to change the code, versus when it makes sense to do maintenance work on the persistence layer.

>and you have staff who know that platform well.

People often ignore how important this is, especially for a small team like Stack Overflow appears to have. A new platform requires training and/or a lack of productivity as employees learn the new system. It requires developer time to convert over any work designed to use the old system. It makes hiring more difficult because now you will either want someone with the knowledge of both your systems which greatly shrinks down your pool of potential hires or new employees will need to learn a new system on the job which increases the ramp time up for any hires. Maybe this doesn't fit the textbook definition of technical debt, but it is still a real and ongoing cost that comes with making a decision to switch tools.

Case in point, one of my favorite system architecture/design blog posts:

Providence: Failure Is Always An Option


Which has one of my favorite quotes about scope creep:

Kevin and I have essentially become allergic to big projects. We attempt to practice “What can get done by Friday?” driven development. Keeping things small precludes a whole class of errors like “We need a new datastore”, ‘cause that ain’t gettin’ done by Friday. It’s hard to sink a week on something you weren’t supposed to be working on when all you have is a week.

I debate this a lot with coworkers. Our industry really loves novelty and new things, but very little expertise is available for new things. I’d rather have an expert at VB6 than a completely new Julia developer.

Because you've already paid the licensing cost for the SQL database and have salaried DBAs taking care of it. And building an alternative solution (including HA, backups etc.) would cost more than $0.

> I fully expected the process to take months, and it did — after hitting delays in getting new servers the entire project took about 11 months.

This doesn't sound like it cost $0

At the time of the decision was made(4.5~ years ago) it probably was $0. Now it's worth the cost to migrate because the value of keeping that data has been proven worthwhile by the cheapest initial solution.

But the real cost is not 11 month. Db Dev were not 100% working in this project, just monitor from time to time that everything is ok.

> Why not just dump it into linear files?

I guess because they want to query them, but in this case I would prefer the ELK stack (Elasticsearch + Logstash + Kibana).

But even then, normal files work great with (all non-Microsoft) software that is actually intended to work with this type of data at this scale.

You could do it on one machine with Pandas DataFrames or on multiple machines with Spark DataFrames. You could keep warm logs in standard log files and put everything else in Parquet. You could use a parallel database. Heck, you could use any of the cloud offerings to do it.

And you could use ElasticSearch. ElasticSearch would be fine for this.

Curious what use-cases you're envisioning here.

Having worked with the system described in the past, it's quite flexible: data can be easily pulled out into more specialized systems for aggregation or trend analysis, while the raw logs remain quickly accessible over long enough periods of time to allow for digging into everything from support cases involving a single person to monitoring distributed attacks. Access can be controlled and monitored with reasonable granularity, and training new folks to use it is as easy as teaching some basic Select queries.

Not gonna suggest it's the most efficient system (if nothing else, SQLServer wastes a TON of disk space), but it minimizes complexity while deftly avoiding the choice between throwing away information when it's still needed and keeping everything in flat files which are too unwieldy to be used.

It's not that the system itself isn't functional at all, but think about cost, flexibility (i.e. the migration as posted) and support.

If you have to move that much data at once because your RDBMS doesn't work in any other way you're going to be in trouble, even if you get the 50k costing SSDs.

All of the things SQL server does can be done with not-SQL-server things, but you suddenly gain the capability to take shards offline and do multi-version migration, multi-host migration, multi-storage-backend migration. You can query in SQL, but also GraphQL. You can use Lucene search, and you can use ES-specific queries. You can have multiple write hosts, you can have tiered storage on application-level, OS-level and SAN-level and they can actually work together.

Again, it's not that SQL server doesn't do anything, it's just that it's probably not the best plan for time series access logs.

I'm sure there are better options. OTOH... I was at SO for about 9 years, and the system described existed for all of it - predating GraphQL, predating even ElasticSearch. There's stuff built on top of it that was never envisioned when it was created, and stuff that wouldn't have been possible without it.

At this point, moving to something else would likely be a far more costly investment just in terms of requirements analysis than the migration described in the blog post. Would it pay off? Maybe! But for a critical bit of infrastructure, that's something you put an awful lot of careful thought and research into before you venture to do anything... And meanwhile, that infrastructure has to be kept running.

One important facet that isn't much technology related at all is the knowledge pool you can work with. Every time a generic component is implemented non-standard you essentially end up with unusable knowledge for both existing people leaving for another job and new people coming in with the common knowledge.

While most companies that are old enough to predate the iPhone usually have a lot of in-house technology that predates conventions and generic implementations, there is plenty of movement towards commonalty instead of keeping that special snowflake stuff in; at least at the places where I have/had influence.

When you freeze an in-house tool in time you essentially rob your own people of progress. The solution isn't rip-and-replace, but a process of continuous improvement is good for everyone.

> But even then, normal files work great with (all non-Microsoft) software

Even Microsoft has a tool [0] that allows you to query IIS logfiles with SQL

[0]: https://www.microsoft.com/technet/scriptcenter/tools/logpars...

>storing traffic logs in an SQL database is a very bad idea, especially if you have 40TB of it

I get this question asked a lot of times in the interview.

Why is it a bad idea to store logs in a SQL database? It has the ACID properties baked in and it is pretty reliable. What is the better alternative? I guess I have never worked on anything that would require keeping track of 40TB of data because of which the drawbacks aren't really obvious to me.

Because everything gets substantially slower and more difficult, see 11 months of busy work motivated by queries being slow, described in TFA. SQL is great for querying data, but you don't need to store the whole history in a single database.

ACID is a red herring. These are traffic logs, not some precious global state that has to handle ACID transactions.

The better alternative is to dump the data into many big files sorted by time. When you need to do some SQL kungfu on a part of the data, store that part into SQL database.

Then use partitioning (which the OP appears to be doing)?

If you need to do some kind of full table scan, what exactly is going to be faster about noSQL? With SQL, your data is presumably going to be non-contiguous so you are going to get slower seeks on disk...but (I believe, although I am not 100% familiar) this can happen with noSQL if you add an index (which are often BTree...so the same as SQL under the hood) and you have things like SSTables/compaction potentially slowing performance too (although maybe not an issue in this application...I don't know).

If you are doing some kind of full table scan though, there is no weird magic in noSQL that allows you to perform this faster afaik. You are reading from disk, that is your limit. The point, afaik, with noSQL for logging is that writes are faster so the throughput is larger...which can be required for this use case but probably isn't relevant for everyone.

> If you need to do some kind of full table scan, what exactly is going to be faster about noSQL?

The real architectural benefit we got from noSQL (remember - "Not Only SQL") was dynamic horizontal scalability. Once you separate Compute and Storage then you can make choices based on your specific workload.

> You are reading from disk, that is your limit.

Separating Compute and Storage works around this. Instead of putting all your logs in a monolithic SQL database, stream them into object storage. When you want to run a query, you can complete it arbitrarily fast by spinning up more compute nodes. Now you're using Massively Parallel Processing and that disk limit stops mattering as much. Lots of systems do this... AWS EMR, Snowflake, Presto, etc etc

So you get a much simpler storage system and faster queries.

You may appear to pay more for storage in this approach, but that's mostly a cloud vs on prem accounting question.

ACID properties are not useful here. Unless every single piece of data is required, which I highly doubt, then ensuring that every single log is written correctly doesn't matter. These are traffic logs, so if the server or application is down, how do these logs get written? I'm assuming that they don't. Have these databases never had outages as well, because if they did then the traffic would be lost as well. I would wager a guess that there are probably lots of gaps in their data that they just don't know about or don't care about.

So given that not every single log is required, then you can step back from ACID requirements and find a solution that is easier to maintain. Taking 11 months to migrate a database is not reasonable for most projects.

According to their pricing page, a 50TB SSD from Nimbus costs $12,500. That would have been worth the expense in DBA time alone, right?

Edit: added the link. https://nimbusdata.com/products/exadrive/pricing/

> That would have been worth the expense in DBA time alone, right?

You're assuming the DBA did nothing else during the project's timeline, and that's not likely the case. A lot of work in projects like this is just letting the batch job run while you keep an eye on it.

Yes. Or even stripe some cheaper $/TB drives together in a RAID to cover the needs.

If they want to migrate to something else, they need to have a look at ClickHouse. When switching from Elasticsearch to ClickHouse 1.5 years ago, I reduced my storage needs by 20, gained SQL, performance, and a ton of analytics features.

In hindsight I would say that Elasticsearch is for full text queries, and if you are using it for something else (access logs) there is a good chance this is the wrong tool for the job.

I worked on a system that generated around 500GB-1TB of log data monthly. An estimation done way before it was put in production said that it would generate less than 100GB/month, so the DB storage was sized accordingly (~6TB, half of which were used for other data). A change that came later in the project (on a different system) was disregarded, but it increased the application traffic by a factor of 5, and the space estimation was not updated. So it was an interesting surprise when the space started filling up way faster after the go live.

To make the issue even more interesting, the system was mission-critical at five nines and the log data was sensitive + had a legal requirement to be kept online for at least one year, for queries. So deletion was not an option and the migration had to be done online.

Fortunately the system used Oracle DB and the log table was partitioned by date. Unfortunately, a partition kept three months of data, so it could reach the full amount of free space on the DB. There was no easy way to add storage to that server, but we did have some space on another one, with lower speed (the prod DB was RAID10, the other one was RAID1)

In the end we had to create a new partition scheme, with monthly partitions stored in different tablespaces, create another table with the same structure and switch it with the online one (a couple of seconds of downtime), backup the database and restore it on the slow storage, then delete the old table structure and tablespaces/datafiles to free the space on production.

A procedure was created to detach the oldest partition from the production database and reattach it to the slower database (where we also implemented compression at a later stage). This meant that the query system needed changes to work with both databases, but at least the system was up.

The good thing was that in Oracle you can detach partitions and tablespaces and transport them to a different database without downtime. I don't know MS SQL enough, but I understand from the migration procedure that this is not possible there. I guess they did that manual daily partitioning by table, but they did not store each table in its own datafile, so it can be moved as a whole (maybe with downtime, but at least it would free the space faster).

Bulk copy is about the fastest you could do in C# or PowerShell, but you would still be moving a ton of data back and forth over the wire between the DB server and the "jump" server running the script.

I would have tried a hybrid approach with the dynamic SQL one. Dynamic SQL sprocs like that don't seem to generate good query plans (at least not with a lot of prodding) in my experience, but if the dynamically generated SQL was done in the C# or PowerShell side rather than a sproc I feel the query analyzer would be less confused about individual queries coming out of the scripts.

The only other advice experience from a past life I have to offer is that I often saw better performance from T-SQL MERGE than T-SQL INSERT FROM. Dynamically generating MERGE statements from C# is not fun, but in that past experience it would give the best results for single-server data migrations like this. (Multi-server there's not much you can do beyond Bulk Copy.)

This is all the reasons why logs must have a limited retention period (usually 1 to 3 months).

Then nobody has to waste 11 months of their life trying to retain and migrate TB of old logs that nobody needs, with no storage to do so and no backup.

Storing web logs in a SQL database is a pretty bad plan.

Combined with "The database was initially designed to have a single table for each day." Oy.

Yeah, if they do that they might as well shard databases and not have this problem in the first place (not that there aren’t better tools)

At this scale and amount of data, definitely. I have however dumped short term data into an unlogged temp Postgres table to query data using Metabase/psql. Worked surprisingly well enough for me to debug something. I'm sure another index on ES would've worked well too, but I'm more familiar with SQL than ES syntax and it's cheaper too.

Yeah, this poke my eye as well.

So this is all interesting from a technical point of view but from a business point of view it makes no sense. The amount of time that you had a talented DB dev working on a project that could have been solved by spending ~25K on a new system seems silly. Bring the new system up and move the data over the network doing the conversion as part of the move. 100TB over a 10G network is ~22H theoretical transfer rate.

I think 11 months is absolutely crazy, and the context for that is that I earn for living by selling my database migration software [1]. They could have used my software to copy up to 16 tables in parallel for a mere $3k. They could have copied a few hundred tables at a time, compare source and target to make sure everything is fine, then drop and compact. As more disk space frees up, they could have moved more and more data in one go. If the data needed to be transformed much, they could have wrote views to pull the data in a proper format and creating tables on the target with that data as they go. True, spinning drives would kill performance, but my tool is easily reaching 50MB/s on SSDs and spinning drives wouldn't add months. I mean, when you have off the shelf products doing what you need, spending so much time fighting with custom scripts is and odd choice. This is really not meant to advertise, I'm just baffled by how much effort was needed for this project.

[1] https://www.spectralcore.com/fullconvert


I really wana know if they even compress the log data in there database.

40 TB of logs in a postgresql, like that can't be efficient?

I would probably just have a scaling blob store and would store it away. That has to be much faster, simpler and cheaper.

They mentioned in the article that they are using SQL Servers clustered columnstore index. This is a poor name, as it's more than just an index. CCI in SQL Server is an on-disk storage format that is a highly compressed columnstore.

I run a 30TB MySql server on EC2. We raid 0 multiple EBS volumes for space and iops.

It is completely unmanageable - I inherited it. We’re moving data out, but slowly. This DB houses PII for tens of millions of users.

So you're paying roughly 72,000 per year? Or are you using SSDs?

gp2. master and multiple read slaves. we we pay millions per year for AWS and support.

Can I submit a proposal?


> There were lots of reasons this needed to be done, one being tech debt. We realized that the original daily table structure wasn’t ideal. If we needed to query something over several days or months, it was terrible — lots of UNION ALLs or loops to crawl through days or even months at a time was slow.

Maybe i'm missing it, but there doesn't seem to be any discussion of the result. _How_ much faster was it after the change? Was it 11-months-worth-of-dev-time faster?

> Maybe i'm missing it, but there doesn't seem to be any discussion of the result.

Sometimes you have to stop writing after 9,000 words. ;-)

I'm in the process of migrating a ~5.4TB time series database. This is also slow/painful. I wonder if there are any general-purpose/open source tools that could help with these sorts of tasks (I've had to roll my own job queue/ETL worker script infrastructure to manage this). In my case, it looks like the full migration will take around 10 calendar days assuming no issues come up.

The trick is to find a common baseline format, like CSV or JSON, then you can leverage the bazillion of existing tools.

The second trick is to split the data in manageable chunks (a day of logs for example), then you can transfer one unit at a time to the new system.

Dump a chunk on one side, load on the other side, verify it's good. It will still take a while but you have a good indication of progress and don't need to start all over if one chunk fails.

I'm baffled with the fact that they were toying with the production database in the production machine. That sounds incredibly dangerous

Actually if you read the article this is not actually critical infrastructure.

All the data is duplicated in both DCs and the source of data still has a copy of the data.

I imagine is not a problem if it goes down for some moments. Developers may be unable to review traffic logs, that's it.

> I'm baffled with the fact that they were toying with the production database in the production machine. That sounds incredibly dangerous

As Taryn stated repeatedly in the post, they simply didn't have enough infrastructure to have development spare copies of this. They didn't even have enough infrastructure to take a backup.

Which is obviously nuts. If your developers are ever spending months of their time to work around infrastructure limitations, that means you need more hardware. You could build a new server to handle this job easily for much less than the cost of this person's time.

Unfortunately, the exchange rate for converting slivers of dev time into hardware is... Unfavorable. And up until two days ago[0], Stack Overflow was running on fumes - shedding staff, perks, equipment to keep the ship afloat. I know of folks who spent months scrambling to get new servers before the old ones died, and... Didn't quite make it.

Given the situation, investing time into making do probably seemed like a much better bet than hoping for a windfall with which to buy equipment.

[0]: https://news.ycombinator.com/item?id=23976881

I'm sitting in a hotel waiting for 22 TB of data to copy over, so I could not have read this piece with more interest. Moving data in these quantities is like pouring tar, and it's always great to hear war stories from people who have done it. It is always more painful than message board theory would suggest.

Someone really needs to make an anomaly filter for logs so data storage becomes miniscule for the purpose of anomaly detection, one Jacky way would be to preload a compression dict with bad strings from the binary, another would be to use something like hyperminhash to spit out low similarity strings.

I can't imagine why he'd use LINQPad when it's possible to get two SQL Servers to talk directly to each other. Just configure a linked server and write an INSERT statement pulling the data from the other server.

Hah. My first thought was, 'what kind of idiot lets a file get to be 44TB... Then I realized who it was. Nice to see that even StackOverflow had an issue like this.

I'm surprised why the author didn't think of availability group earlier? maybe i miss sth from the post?

I wonder how effective it would move data using spark? Have anyone considered that!

any reason why creating a table per day / month i stead of using one table partitioned by day / month?

any reason for creating a table per day / month instead of a single table partitioned by day / month?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact