At Loopt, long before maps were included on mobile platforms, we served our own map tiles. These 64x64 pixel PNGs were often less than 100 bytes. Even though we only covered the US at the time, there were upwards of 20 million of these tiny files.
They caused file system destruction. We tried ext3, reiserfs, ufs, ntfs - they all choked. In the best of cases each file took as little as 512 bytes on disk, 5x larger than the data within. Performance was awful, no matter how we bucketed the files in directories or tuned the file systems. Reiser 4 trunk was looking promising, but then Hans was arrested.
The solution was to store the images in a database. They fit in-row without needing blob storage, and the DB indexes were much faster than any of the file systems we tried. Backups took less space and less time.
 He was eventually convicted of killing his wife: http://en.wikipedia.org/wiki/Hans_Reiser
For very small files (such as your <100 byte PNGs) it probably wouldn't be worth it, but for larger blobs it be used to avoid the 'millions of files' problem.
I don't know if it would have worked, but I though it was a cute idea.
Something like Redis might've been sweet, too, but would require you to hold all that data in ram.
This is patently false, reads/writes on a DB are often much faster which is why we use DB systems.
Even image data especially on write can be faster when written to a DB vs. flat file because of sequential writes. Small files are slow to write on filesystems which is why we have DBs.
In most cases with a high read / low write load it's certainly a better idea to put thumbnails on a CDN for a variety of reasons but to think that writing small chunks of data to a file system is always faster than a DB is to misunderstand what DBs do and when and why they are useful.
Not really. We use db's because file systems aren't ACID and we like to not lose data, and we like to be able to query data, and because we like multiple machines having access to that data. But the file system is generally much faster, especially when appending something like a log (yea, db's do that too, that's beside the point).
> but to think that writing small chunks of data to a file system is always faster than a DB is to misunderstand what DBs do and when and why they are useful.
I don't think so. Perhaps you can expand on that.
Write a program that creates 100,000 1k-8k byte files in a zip file with no compression.
Run multiple copies of these programs, see which writes files faster.
Keep running the program(s) until the drive is 90% full. (Put the files in different directories if you feel this is fair.)
Run the programs again.
You'll notice that on a decent 7200 RPM drive the ZIP code is writing about 40-50MB/sec on a good drive.
You'll probably be lucky to get 4-5MB/sec from the file code (unless you're on an SSD, even then it will probably be half)
If you feel like it also create a database and do the same with a file table like "CREATE TABLE data (id int NOT NULL PRIMARY KEY, data varbinary(8000))"
Inject 1000 to 2000 rows between commits.
The DB should also outperform the file code, but probably not the ZIP because it has to write the data twice.
The reason is that writing file to disk quickly produces a random IO pattern as the FS tries to allocate files all over the disk. Also, everytime you create a file you have to update a whole bunch of file descriptors, a modern file system is a B-tree and journal. Sounds awfully like a database right? Except this database is optimized for large files, at least 32-64kbytes and preferably in the hundreds of MB.
And my main point was, speed is not the reason we use databases, transactional semantics and query capabilities are.
edit: It's not that your point is wrong by any means, it's just that you're arguing against a point that he wasn't really emphasising. I don't think anyone, including him, would try to tell you that constant appending to a log file would be faster through a db than just a direct write to the file.
Link to paper: http://research.microsoft.com/apps/pubs/default.aspx?id=6452...
"Images, files, and binary data: Great now you can't use nginx or another lightweight web server to serve them up."
That's not true - you can create a private HTTP endpoint in your web app that your lightweight caching frontend can serve and cache from.
(I'm not saying it's not a better idea to store things in S3 or another big-blob service, but the reasoning in this article doesn't hold water)
In terms of efficiency, you're only paying a small price on the first hit for an image before the cache comes into play. It won't be as efficient for you as directly serving from S3, but with a properly configured cache, the effect would be negligible.
In terms of complexity, you're looking at a single nginx configuration endpoint of a handful of lines to cache the images on-disk or in memcache for fast serving. In the case where you've got CloudFlare in front of your web app, it's free. Overall it can be less complex, because you can re-use your existing code for talking to the database, rather than adding code to talk to another service like S3.
The real argument against storing big binary data in the database should be based on the argument that databases aren't necessarily always tuned properly for storing things that aren't relational (which the article briefly addresses), and that storing your binary data alongside your relational data could have deleterious effects on the performance of your relational queries.
The recommendation for using FILESTREAM is that the individual units of BLOB data be 1 MB in size or higher.
Though, the benefits when that is the case, look mighty impressive.
Now if you throw in having to do a little bit of post-processing based on various conditions, then the scales start to tip.
Of course, it depends on what you are doing, but many databases make working with BLOBs pretty straightforward and I wouldn't rule out the benefits just to follow some over-general rules.
The "size of database backups" point has been countered several times on this page effectively so I won't bother but in terms of scaling your app, restoring backups, managing versioned backups, moving your app around and a whole host of other management and maintenance related tasks, being able to completely blow away your cache of image files and have them rebuild on demand is tremendously useful.
Also on the point of "storing thumbnails" this assumes that your thumbnail size requirements never change over time. Using the approach of storing the source image in a database and caching the files on demand means you can change your design and all you need to do is change the size parameters in your URLs, bam, new image thumbnail sizes.
One shouldn't forget to mention that a filesystem is a DB, albeit one specifically engineered for the purpose of storing files. There's some overhead in doing so, because the filesystem stores a fair bit of metadata about the file. I don't think this argument always holds true.
There was a paper (a little dated now) from MS research which looked into this a little: http://research.microsoft.com/pubs/64525/tr-2006-45.doc. The research is not really conclusive, but it suggests files smaller than 256kb might be better stored in the DB, where anything larger the filesystem performs better. Would be nice to see some more research on the topic.
It really depends on your situation. Point by point:
>read/write to a DB is always slower than a filesystem
Obviously this is true, but depending on the context, the difference may not be that important.
>your DB backups grow to be huge and more time consuming
This just doesn't hold up to scrutiny. You still have to back the files up, regardless of where you put them. If you keep them in the DB, then backing up the DB automatically backs up the files. If you keep them separately, you're adding complexity to the backup/restore process.
I guess there are times when you keep references to files in your DB but can live with having the files and DB get out of sync if you have to restore from a backup, but that's a design decision, not a hard and fast rule.
>access to the files now requires going through your app and DB layers
Again, there are times when this simply doesn't matter.
Let's not forget that one of the main reasons people use databases is that they simplify concepts like locking. If you start tying your database to some external structure within your file system, you not only have to implement those features yourself, but you also have to coordinate them with the DB's own implementation.
This isn't true and has never been true. If you're writing a lot of records, especially small ones a DB is often faster because the writes are more sequential and the database optimizes how pages are flushed to disk.
There's also SQL Server's FILESTREAM attribute, which is rather interesting. It tells the engine to store the data as files on the back-end. In theory that combines the best of both worlds, though I haven't had many good chances to poke at it yet.
If they are actual files you can do an incremental backup. If they are in the database the backup is just one huge file that has to be backed up every time. Just dealing with the IO takes much longer, not to mention disk storage and possibly network bandwidth.
If you need to restore a single image you are going to need to create a whole database just to hold the backup before pulling out the one image - depending on how much data you have this can take hours. But with a file it's trivial.
If you are also doing transaction log backups restoring a file that was deleted is fairly trivial too.
This is simply not true, in fact it is so not true that I must assume you have never actually used a database.
Don't be afraid of putting logs in databases and then figuring out how to scale it. It's not that hard and it's very worth it.
It's really more about using the right tool for the job, Splunk, Loggly, NoSQL solutions, etc. are a MUCH better fit than an RDBMS for this sort of work.
It uses custom index files stored on a filesystem.
It can also connect to MySQL and to HBase, but its own data is kept in its own index.
Screw referential integrity.
Screw db security.
Screw simple administration (single backup).
Raw performance of one part of the system is the only concern.
While it is "raw performance of one part of the system", it's also the hardest aspect of a system to scale properly. Not doing these three things can often get a user months or years further down the line without having to worry about it at all.
This gives machine readability but also makes for really simple tools for extracting whatever "traditional" log lines you like. (see: https://github.com/trentm/json) You can also easily add new fields without breaking existing log parsing code.
I believe I first read about logging in JSON here: http://journal.paul.querna.org/articles/2011/12/26/log-for-m...
Let's imagine the following scenario. If your database crash and you lose everything there is good chance you can't run your application any more (if you don't have backups), but you still have your images! great! what do you do with them. If you lose your files but not the database, cool you still have the paths of all your files that have been lost in the database, so now what?!
The idea is to keep files (such as images) in the database to be able to pull them down on the file system and backup your database. Plus it's not always stored in the database, for example with MSSQL you can use file stream to store document.
We did this for one project because we wanted transactional semantics when it came to image files and other related information, and this made it easy. We didn't serve images from the db - instead we created image files on the web servers from what was in the database.
EDIT: Of course you have to make sure you don't do any writes that block the main operation, e.g. by using message queues.
I feel like this article is making a large number of assumptions about the problem I am trying to solve.
I always thought it was the other way around. Databases can cache data and keep files opened (keep the file locked or something, I don't know).
Also my own benchmarks on an old laptop (MySQL versus FAT32) turned out that MySQL was actually faster, even if you still had to connect (no persistent connection). I should try this again on a decent server and use NTFS or Ext3 (see if that makes much difference), but in general I can't really see how a database can be slower--given that you use it for what a database should be used for. Of course if you read a movie from it once a day, it's not gonna cache that and the file handle is not going to help much either. Then MySQL is only an obstruction to the data which it has to read from the disk anyway. Hmm or is that what he meant here?
1. Database pulls file off disk, or out of cache, reads into memory buffer
2. Memory buffer is copied to a new buffer in SQL client library (over a socket or whatever)
3. Memory buffer is copied to a new buffer in your scripting language
4. Which is then written to the web server's memory buffer, where it gets served to the client.
Which ends up being a whole lot of copying (and a whole lot of work even when the data is cached in the DB) compared to:
1. Web server mmap's file
2. Web server streams file out to client
Is this situation no longer true? Are there newer tools that handle "serve a file from a BLOB" more efficiently? Do they also handle things like seeking (resumes) in a non-wasteful manner? Or is memory performance so good and the blobs we're talking about so small (bytes, not kilobytes) that it doesn't matter?
Of course, the author of the article also believes that every database is hooked to a web server that will be constantly pushing its data to hundreds of thousands/millions of users, that every 10 milliseconds lost means another 1% of your users abandon ship, and that if you claim to have some other circumstances, you are pleading "unique snowflake" status and should just get shut up and follow his advice anyway.
we have greylog in place to help query these logs from our syslog server, it has helped us identify alot of different issues since we have put it in place, not to be rude to the poster but I dont think you have the experience with really big log files and how painful it can be (or how long it can take) when you have to grep through this information by hand, zcating an old compressed log file can take up to ten minutes before you even start seeing data let alone relevant data.
You're already doing what he was suggesting anyway - storing your logs somewhere other than your application database.
The disk contention issue could be a problem because writes are hard to scale, but depending on the regularity of logging and the indexing used (I'm assuming it's done by timestamp here) it's probably not actually the biggest of deals - you should end up mostly doing append-only writes. In oracle, for example, all that'll happen is that for each log transaction some stuff will get altered in memory (i.e. a block of the table will get appended to, and an append will occur on the index) and the roll forward log will get written to on disk. Every so often the table and index blocks will get written to disk, but that will be much less than once per transaction in an appending case.
That said, overall I'd still recommend having the logging done on a separate system, or at least having profiling set up on your log table/user/whatever such that logging can't eat your production server if someone decides to excessively ramp up the log levels.
Something like Big Table is pretty good at storing small blobs. It's also pretty good at scaling, so you should never really have to delete "usage statistics, metrics, GPS locations, session data" or anything really.
Regarding logs, I completely agree. Something like tornado + scribe + s3 works quite well. (check out https://github.com/gregbayer/logcatcher) But then EMR + Scribe can be used to access it like a db.
* Credit card numbers
* Social security numbers
* Plain text passwords
IMO Papertrail is a much better service with the exact same MO.
Never let this author near your GIS/spatial system.
I wouldn't recommend logging to the same database instance that's used to store customer data for both security and performance reasons, but that's no reason not to use a database for logging.
I agree, what it should say is: never log to the main database, create a DB especially for it (preferably in an entire different file system/disk/machine)
And don't forget to set it to 'append only'
Indeed; MongoDB is quite often used for storing (structured) log data.
I was young and stupid, and I needed the money.
Why is it wrong? Transactions and referential integrity. Without being able to guarantee transactions that update data in the database along with any other data, non-DB file-based or not, then you chance issues with referential integrity. Even if you have a transactional infrastructure that you would bet your life on, something could happen like a partial restore that restores the file system but not the database.
It isn't that it is impossible to have a working transactional infrastructure without storing these things in the database, it is that it is much easier to store it in the database, and sometimes making things easy can save your ass.
Bear in mind not storing files in a database does create restore issues, since database and file backups will have been created at different times, so there's always a risk of orphaned/missing files.
Map/reduce on log data is absolutely wonderful and opens new worlds of answers that you can only dream of. A Hive workflow is fantastic for parsing log data. You can answer any question, and even GRAPH any answer from your logs. The insight gained from capturing and processing logs can do wonders for your performance problems, ideas for new uses of your code, and so on.
I hate any approach to engineering that says "you'll never need it, so don't do it"; do it, and discover the great things that some things unlock for you. Burying your head in the sand on this point just limits your ability to unlock unique insight and gain an edge.
I particularly disagree with this because of the flourish at the end of article, where he says there are no exceptions to his advice. Flatly wrong, and if you tell me "don't do it, trust me," I lose a bit of respect for you as a communicator.
So putting a section of time from a log into a data base for generating some reports and then dumping it, I don't have a problem with that.
On top of that, it's a bad idea because, where are you going to put your database logs? In the database? What happens when you run out of disk? All your applications are going to be trying desperately to log that they're out of space, and your database won't have anywhere to put those messages. This is why /var/log is traditionally a separate filesystem in Unix. But if you log to the database, unless you go through some hoops (tablespaces) they're all going to the same partition and the same physical disk. Believe me, these scenarios happen in real life, and usually on a weekend. Even worse, some applications (though not Postgres, I think) will try and hold onto pending writes until there is free disk, so the moment you free up space, you lose it. You don't want to have to shutdown the production database just because your logs ate up a bunch of space.
The right thing to do is let your database have its own machine (yes, physical machine) and its own disks. Ideally, your Postgres install has separate physical disks for the data and the WAL, and I'd probably put the regular logs on another disk as well if I could afford it. You'll be shocked how much better it will perform, too. Don't fool yourself. A bunch of VPSes is not the same as real hardware running a real database like Postgres.
There's some more interesting discussion on large append-only tables here: http://archives.postgresql.org/pgsql-performance/2012-02/msg...
If you're sufficiently careful, I imagine you could find similar performance with PostgreSQL-based logging as you would with a rotating flatfile log. Except you wouldn't need to worry about consolidating and parsing them.
Here's how I see it:
Logging isn't ideal in traditional databases, because then you have to scale writes. Scaling reads is fairly easy in a traditional database, but scaling writes is much harder compared to something like HBase or Cassandra
Image storage in a traditional database will always suffer compared to other options. Filesystems are faster and easier to use, and things like S3 match requirements better in some cases.
I'm not really seeing anyone argue that ephemeral data is better in a Database than something like redis.
By comparison, people write about scaling MySQL to 25000 writes per second
It's very hard to get a single server to scale multiple orders of magnitude higher (even with FusionIO or something). Getting around that single master server for writes is possible, but either complicated or expensive. Even then I doubt you will get to 1 million writes per second without sharding, and if you are going to shard you might as well go to a NoSQL solution built for the problem.
I am very interested in the first article though. It mentions number of instances in an EC2 availability zone - what does this mean? Is this similar to a clustering/load balanced solutions in traditional RDBMSes?
The automated tooling that Netflix has developed lets us quickly deploy large scale Cassandra clusters, in this case a few clicks on a web page and about an hour to go from nothing to a very large Cassandra cluster consisting of 288 medium sized instances, with 96 instances in each of three EC2 availability zones in the US-East region. Using an additional 60 instances as clients running the stress program we ran a workload of 1.1 million client writes per second.
Edit: Argh! Sorry, I just didn't read far enough before I posted. They seem to be being powered by some fairly heavy duty hardware. Is there a comparitive study using Oracle RAC or something like this?