The admins at my work have always told us (i.e. people wanting to run data analysis) to avoid executing complex queries on production servers.
The workflow I use is to run simple select queries on prod databases. To bring the subset of data I'm interested in down locally Once I have the "extracts" I'm interested in I'll transform the data locally using SAS.
SAS has pretty powerful time interval routines which is what I'd typically use to compute something like the example in the article.
My work has just started looking at "big data" we have a new hadoop databases which is supposed to be used for that so I assume if I ever need to start looking into data stored there it will make more sense to run the SQL "on the database". So the article has some useful info there.
One thing to note here might be that if network is the bottleneck you'd likely be asked the opposite. In my experience if people pay attention to indexes/partitioning they can win dba's hearts, but dbas often don't explain these things because most people don't listen. So to them its easier to say "don't run complex queries".
Part of the problem with people paying attention to indexes/partitioning is that there's no very easy way to display that to the end user. A person knowledgeable in SQL can query the right tables to determine this but at that point you're pretty much thinking like a DBA in the first place. If major vendors had a nice visualization of such a thing built into the interaction tools like SSMS/SQL Developer/PGAdmin3 etc. then it might be easier (these might exist, I'm just not aware of them).
So yeah, the end result is DBA's give the generic "don't run SELECT * and use a WHERE clause when you're querying large tables"
Hopefully that is a dump as well. At least with Hadoop you have a little more control over resources with the right configuration. When we run Hive on our main HDFS cluster without limiting mappers and reducers it will happily bog down the entire cluster to give you what you need - but limiting the mappers, while slowing your throughput - eliminates the risk of resource hog for noncritical queries. Even then, copy of the data with dedicated resources is still preferred because without a lot of mappers and reducers your Hive queries are going to slow to a crawl.
Solar_Fields and Bigger_Cheese, my apologies for not specifying the db. In this case I'm talking about a production level data warehouse specifically designed to perform complex queries and not a prod app db. Yes, I agree- best practice is to separate the two. Glad you found the article interesting.
The workflow I use is to run simple select queries on prod databases. To bring the subset of data I'm interested in down locally Once I have the "extracts" I'm interested in I'll transform the data locally using SAS.
SAS has pretty powerful time interval routines which is what I'd typically use to compute something like the example in the article.
My work has just started looking at "big data" we have a new hadoop databases which is supposed to be used for that so I assume if I ever need to start looking into data stored there it will make more sense to run the SQL "on the database". So the article has some useful info there.