The problem is that relational SQL is bad for logs, but by the time it gets to the scale where it's problematic that there's too much volume in the logs to make anything "easy". Simultaneously there's a lot of business value in that log data that you don't want to lose.
Yes, SQL's a poor fit for logs, but it's a better fit then a lot of other things, including not logging at all. Better solutions exist, but they don't exist in a bubble, and there's a cost to integrating them and migrating to them. A lot of these comments seem to be judging a technology decision based solely on hindsight without realizing that there are legitimate reasons for logging to SQL.
The challenge is, a migration would require a lot of hardware resources (storage mainly) to setup a new cluster and the poster couldn't get one damn hard drive to work with, so any migration would have been a death march from the start.
Knowing that and that the original system was out of disk and had no backup of any sort. I would personally question the effort to try to keep all these old logs.
I spent years working on system handling 50+PB/day of logs. No database or ELK can handle that, and even if it did it would be prohibitively expensive.
A mighty 400 GB/s: i.e. much less than the > 50 PB/day of logs the other person mentioned;
1600 hours of SD video per second: i.e. about 1-2 million concurrent HD streams, or much less than the amount actually served by YouTube.
IBM Summit "world's most powerful supercomputer": < 5000 nodes, i.e. much below the median cell size described in the 2015 Borg paper. Summit is a respectable computer but it would get lost in a corner of a FAANG datacenter.
Otherwise the tape alone to store it on would exceed their total operating budget in a day, so they have to be a bit clever about it.
not quite as big a difference
Anyway, GPDR requires you to have a purpose for any log that contains any IP address. Keeping logs for undefined purposes and unlimited time frames is not ok any more.
We log to SQL so that we can instantly obtain a full list of log entries that pertain to a specific user action trace id. We can go from a collection of user actions over a larger business process and then for each action we can pull all of the log entries that were generated. All of this is exposed through a nice web interface with full text search capability over the log messages. Without logging to SQL, we would have a hell of a time building something similar.
In order to keep this from exploding out of control, we have a strict 90 day expiration policy on all persisted business state and log entries. Our log table indicies are:
- User Action Trace Id (16 bytes)
- Timestamp (64 bits)
- Message (Variable - FTS)
We store all log entries in a single SQLite database (logs.db) contained in each environment. These are queried over HTTP from centralized management tools.
We usually end up querying based on session signatures (which is something like user agent + JWT), on plain user ids or on trace ids. This gets you a pretty neat timeline with references for full records if you need to drill down on a specific event. Makes everything super fast and when you need more data or want to read PII you can drill down (depending on your access roles).
There were always a bunch of bad decisions and a few good one in hindsight. I think it's completely fair to discuss solutions (including more recent tech like ELK) to the problem even more so if the problem still exists.
> Yes, SQL's a poor fit for logs, but it's a better fit then a lot of other things, including not logging at all.
No I disagree, not logging 40 TB of data to SQL would have been at least an equally good if not better choice, based on the input we have from the blog post. Just keep the past few weeks of raw traffic logs and be fine.
Well, that was my first reaction anyway. It seems to me that this is more of a 'when all you have is a hammer, everything looks like a nail' to me; if you are a vendor shop (i.e. Microsoft-only) and you are a DBA and specifically just for MSSQL, then this is probably the best way you could come up with, based on the skills, experience and tools at hand. It does make me wonder if at any point someone thought to ask for help instead of keeping this going.
Because as a database administrator, you often have to solve hosting problems without telling your developers to rewrite all their code.
The best you can hope for after a "hold the eff up" rewrite is for everything to keep working the same "but it's more hardened/scalable/modular/blah blah", and the worst you can hope for is to screw up some critical business process while the kinks are worked out.
Also, there is rarely any incentive for anyone from Joe Developer all the way up to C level to even call for this in the first place.
HAProxy logs syslog messages, in a configurable format and to a destination of your choosing. All you have to do is to setup your favorite logging agent between fluentd/logstash/syslog-ng to listen on syslog and forward messages to a SQL database.
My point was more about logging something straight in to a RDBMS wasn't much of a default at any point in time. Earlier on it was too expensive, and later on better systems came along. Some systems have built in text analysis and time series data specifics, but why would you worry about those when you have ElasticSearch and Kibana for most generic setups. Even before that you have syslog and a lot of log parsing libraries that worked great (well, unless you used windows event logs which has pros and a lot of cons).
1. Buy the hardware first with blazing drives and roaring CPUs and get it stood up before you start trying to change your schema on a potato
2. Do his last step now: set up an AG and let SQL Server do it's thing
3. Now go start planning your ETL/schema changes and writing scripts while you wait 1-3 days (or in my case 10 as we had to suspend during business hours) for data to migrate.
4. Fail-over and promote and start collecting data on your superbox
5. If the world didn't explode, you do the same thing at your other DC
6. Now, if you absolutely must, apply and test all your schema changes in production instead of dev first but enjoy the fact that those changes are exponentially faster!
You could also use log shipping, but with those restore times you'd have data loss I think. Haven't done that in years - I've been lucky enough to only have to deal with projects where we shutdown, restore, and do a full downtime migration. So much nicer...
Next, we get the cache to write to ElasticSearch and on lookup it would first check if memcache has a pointer to ElasticSearch already and if not it would spawn the job, copy the input set to ES, run the query there and add a pointer in memcache.
This meant that in practise you'd retain the same functionality you had before while not having a hard dependency on the backend.
Next, we scaled op the API service and the cache service to a very high number and started querying for time ranges that were known to not have pointers in memcache. Depending on the amount of spot instances we could get for the price we wanted and the saturation of the 10GbE DirectConnect we could do up to 400MiB/s, 29TiB took us about 3 days because the DirectConnect was also used to other traffic that had a higher priority and the back pressure system would reduce the API request count accordingly (using Istio).
In the end the users had only two changes:
1. No more 'talking to the database', you have to use the API which started out rather restriced
2. Some analysis takes a little longer during migration due to the SQL-ES copy action; after the MSSQL stuff was all gone the queries were much faster than before
When I've done this with 20TB DBs, it usually took several days to complete. That's much better than the 11 months mentioned in the blog post.
Quote: "We’re a very lean shop when it comes to hardware, and this is by far our largest chunk of data we have to store, so there wasn’t a place anywhere on the network to store a 33-38TB full backup temporarily, while it was being restored to a server."
As others have mentioned in the thread, a lot of the problems seem to have been solvable by just buying/attaching more storage, even if only temporarily.
Not sure which the data couldn’t just be compressed into daily summaries in the DB, while the raw logs are stored (Compressed) in the file system in case you ever had to go back.
Storing them raw can make sense. You might later come up with new ways to analyze the data which your aggregation logic of choice does not support.
"Our HAProxy1 logs aka traffic logs, are currently stored on two SQL Servers, one in New York, and one in Colorado. While we store a minimal summary of the traffic data, we have a lot of it. At the beginning of 2019, we had about 4.5 years of data, totaling about 38TB. "
and I asked myself, why the hell would you want this data?
Logs serve two purposes:
- find a fault and fix it
- audit what someone (or many someones) did
You never need years of data to find evidence of a fault: either it's available in the last few weeks or you didn't log the data in the first place (unless, of course, it's a bug that only shows up on leapdays, or leapseconds, or is similarly rare...)
So that leaves tracking your users in a way which isn't covered by your user's own history. Again, a month or two might be helpful, sometimes... but why years?
I'll give an example: a bug in the revision ordering logic that could be triggered by two edits being submitted almost simultaneously where one of the editors had made another edit within the past few minutes. The bug resulted in rows being interleaved between the two revisions and the post being rendered as a combination of two unrelated edits.
Almost impossible to reproduce. Affected a low double-digit number of posts out of millions edited over the course of several years.
Was able to finally nail down the scenario by pulling logs for all editors involved in each of the events over the years and identifying the common factor.
(You can argue this is a crazy way of storing revisions, but that's beside the point; at this scale even very unlikely bugs do show up with some frequency and it is nice to be able to investigate them)
Seems to me rather storing traffic logs in an SQL database is a very bad idea, especially if you have 40TB of it. Anybody can explain why SQL database makes sense for large time series data? Why not just dump it into linear files?
Let's say, just to say, that you have effectively free licensing for Microsoft SQL Server (because you already own the licenses in question), and you have staff who know that platform well.
Sometimes it's easier to use the tools you already have rather than go acquire a new platform to achieve a goal.
As with anytime you choose a persistence layer, you're accumulating technical debt the instant you make the choice. Over time, if better persistence layers become available, you have to make a judgment call about when it makes sense to change the code, versus when it makes sense to do maintenance work on the persistence layer.
People often ignore how important this is, especially for a small team like Stack Overflow appears to have. A new platform requires training and/or a lack of productivity as employees learn the new system. It requires developer time to convert over any work designed to use the old system. It makes hiring more difficult because now you will either want someone with the knowledge of both your systems which greatly shrinks down your pool of potential hires or new employees will need to learn a new system on the job which increases the ramp time up for any hires. Maybe this doesn't fit the textbook definition of technical debt, but it is still a real and ongoing cost that comes with making a decision to switch tools.
Providence: Failure Is Always An Option
Which has one of my favorite quotes about scope creep:
Kevin and I have essentially become allergic to big projects. We attempt to practice “What can get done by Friday?” driven development. Keeping things small precludes a whole class of errors like “We need a new datastore”, ‘cause that ain’t gettin’ done by Friday. It’s hard to sink a week on something you weren’t supposed to be working on when all you have is a week.
This doesn't sound like it cost $0
I guess because they want to query them, but in this case I would prefer the ELK stack (Elasticsearch + Logstash + Kibana).
You could do it on one machine with Pandas DataFrames or on multiple machines with Spark DataFrames. You could keep warm logs in standard log files and put everything else in Parquet. You could use a parallel database. Heck, you could use any of the cloud offerings to do it.
And you could use ElasticSearch. ElasticSearch would be fine for this.
Having worked with the system described in the past, it's quite flexible: data can be easily pulled out into more specialized systems for aggregation or trend analysis, while the raw logs remain quickly accessible over long enough periods of time to allow for digging into everything from support cases involving a single person to monitoring distributed attacks. Access can be controlled and monitored with reasonable granularity, and training new folks to use it is as easy as teaching some basic Select queries.
Not gonna suggest it's the most efficient system (if nothing else, SQLServer wastes a TON of disk space), but it minimizes complexity while deftly avoiding the choice between throwing away information when it's still needed and keeping everything in flat files which are too unwieldy to be used.
If you have to move that much data at once because your RDBMS doesn't work in any other way you're going to be in trouble, even if you get the 50k costing SSDs.
All of the things SQL server does can be done with not-SQL-server things, but you suddenly gain the capability to take shards offline and do multi-version migration, multi-host migration, multi-storage-backend migration. You can query in SQL, but also GraphQL. You can use Lucene search, and you can use ES-specific queries. You can have multiple write hosts, you can have tiered storage on application-level, OS-level and SAN-level and they can actually work together.
Again, it's not that SQL server doesn't do anything, it's just that it's probably not the best plan for time series access logs.
At this point, moving to something else would likely be a far more costly investment just in terms of requirements analysis than the migration described in the blog post. Would it pay off? Maybe! But for a critical bit of infrastructure, that's something you put an awful lot of careful thought and research into before you venture to do anything... And meanwhile, that infrastructure has to be kept running.
While most companies that are old enough to predate the iPhone usually have a lot of in-house technology that predates conventions and generic implementations, there is plenty of movement towards commonalty instead of keeping that special snowflake stuff in; at least at the places where I have/had influence.
When you freeze an in-house tool in time you essentially rob your own people of progress. The solution isn't rip-and-replace, but a process of continuous improvement is good for everyone.
Even Microsoft has a tool  that allows you to query IIS logfiles with SQL
I get this question asked a lot of times in the interview.
Why is it a bad idea to store logs in a SQL database? It has the ACID properties baked in and it is pretty reliable. What is the better alternative? I guess I have never worked on anything that would require keeping track of 40TB of data because of which the drawbacks aren't really obvious to me.
ACID is a red herring. These are traffic logs, not some precious global state that has to handle ACID transactions.
The better alternative is to dump the data into many big files sorted by time. When you need to do some SQL kungfu on a part of the data, store that part into SQL database.
If you need to do some kind of full table scan, what exactly is going to be faster about noSQL? With SQL, your data is presumably going to be non-contiguous so you are going to get slower seeks on disk...but (I believe, although I am not 100% familiar) this can happen with noSQL if you add an index (which are often BTree...so the same as SQL under the hood) and you have things like SSTables/compaction potentially slowing performance too (although maybe not an issue in this application...I don't know).
If you are doing some kind of full table scan though, there is no weird magic in noSQL that allows you to perform this faster afaik. You are reading from disk, that is your limit. The point, afaik, with noSQL for logging is that writes are faster so the throughput is larger...which can be required for this use case but probably isn't relevant for everyone.
The real architectural benefit we got from noSQL (remember - "Not Only SQL") was dynamic horizontal scalability. Once you separate Compute and Storage then you can make choices based on your specific workload.
> You are reading from disk, that is your limit.
Separating Compute and Storage works around this. Instead of putting all your logs in a monolithic SQL database, stream them into object storage. When you want to run a query, you can complete it arbitrarily fast by spinning up more compute nodes. Now you're using Massively Parallel Processing and that disk limit stops mattering as much. Lots of systems do this... AWS EMR, Snowflake, Presto, etc etc
So you get a much simpler storage system and faster queries.
You may appear to pay more for storage in this approach, but that's mostly a cloud vs on prem accounting question.
So given that not every single log is required, then you can step back from ACID requirements and find a solution that is easier to maintain. Taking 11 months to migrate a database is not reasonable for most projects.
Edit: added the link. https://nimbusdata.com/products/exadrive/pricing/
You're assuming the DBA did nothing else during the project's timeline, and that's not likely the case. A lot of work in projects like this is just letting the batch job run while you keep an eye on it.
In hindsight I would say that Elasticsearch is for full text queries, and if you are using it for something else (access logs) there is a good chance this is the wrong tool for the job.
To make the issue even more interesting, the system was mission-critical at five nines and the log data was sensitive + had a legal requirement to be kept online for at least one year, for queries. So deletion was not an option and the migration had to be done online.
Fortunately the system used Oracle DB and the log table was partitioned by date. Unfortunately, a partition kept three months of data, so it could reach the full amount of free space on the DB. There was no easy way to add storage to that server, but we did have some space on another one, with lower speed (the prod DB was RAID10, the other one was RAID1)
In the end we had to create a new partition scheme, with monthly partitions stored in different tablespaces, create another table with the same structure and switch it with the online one (a couple of seconds of downtime), backup the database and restore it on the slow storage, then delete the old table structure and tablespaces/datafiles to free the space on production.
A procedure was created to detach the oldest partition from the production database and reattach it to the slower database (where we also implemented compression at a later stage). This meant that the query system needed changes to work with both databases, but at least the system was up.
The good thing was that in Oracle you can detach partitions and tablespaces and transport them to a different database without downtime. I don't know MS SQL enough, but I understand from the migration procedure that this is not possible there. I guess they did that manual daily partitioning by table, but they did not store each table in its own datafile, so it can be moved as a whole (maybe with downtime, but at least it would free the space faster).
I would have tried a hybrid approach with the dynamic SQL one. Dynamic SQL sprocs like that don't seem to generate good query plans (at least not with a lot of prodding) in my experience, but if the dynamically generated SQL was done in the C# or PowerShell side rather than a sproc I feel the query analyzer would be less confused about individual queries coming out of the scripts.
The only other advice experience from a past life I have to offer is that I often saw better performance from T-SQL MERGE than T-SQL INSERT FROM. Dynamically generating MERGE statements from C# is not fun, but in that past experience it would give the best results for single-server data migrations like this. (Multi-server there's not much you can do beyond Bulk Copy.)
Then nobody has to waste 11 months of their life trying to retain and migrate TB of old logs that nobody needs, with no storage to do so and no backup.
I really wana know if they even compress the log data in there database.
40 TB of logs in a postgresql, like that can't be efficient?
I would probably just have a scaling blob store and would store it away. That has to be much faster, simpler and cheaper.
It is completely unmanageable - I inherited it. We’re moving data out, but slowly. This DB houses PII for tens of millions of users.
Maybe i'm missing it, but there doesn't seem to be any discussion of the result. _How_ much faster was it after the change? Was it 11-months-worth-of-dev-time faster?
Sometimes you have to stop writing after 9,000 words. ;-)
The second trick is to split the data in manageable chunks (a day of logs for example), then you can transfer one unit at a time to the new system.
Dump a chunk on one side, load on the other side, verify it's good. It will still take a while but you have a good indication of progress and don't need to start all over if one chunk fails.
All the data is duplicated in both DCs and the source of data still has a copy of the data.
I imagine is not a problem if it goes down for some moments. Developers may be unable to review traffic logs, that's it.
As Taryn stated repeatedly in the post, they simply didn't have enough infrastructure to have development spare copies of this. They didn't even have enough infrastructure to take a backup.
Given the situation, investing time into making do probably seemed like a much better bet than hoping for a windfall with which to buy equipment.