OMG, the author just described the last place I was at. Processed a few Tb of data and suddenly there's this R. Goldbergesque system of MongoDb getting transformed into PostGres...oh, wait, I need Cassandra on my resume, so Mongo sux0r now...displayed with some of the worst bowl of spaghetti Android code I've witnessed. The technical debt hole was dug so deep you could hide an Abrams tank in there. To this day I could not tell you the confused thinking that led them to believe this was necessary rather than just slapping it all into PostGres and calling it a day.
All because they were processing data sets sooooo huge, that they would fit on my laptop.
I quit reading about the time the article turned into a pitch for Stitch Fix, but leading up to that point it made a good case for what happens when companies think they have "big data" when they really don't. In summary, either a company hires skills they don't really need and the hires end up bored, or you hire mediocre people that make the convoluted mess I worked with.
I really hate pat-myself-on-the-back stories, but I'm really proud of this moment, so I'm gonna share. One time a principal engineer came to me with a data analysis request and told me that the data would be available to me soon, only to come to me an hour later with the bad news that the data was 2 terabytes and I'd probably have to spin up an EMR cluster. I borrowed a spinning disk USB drive, loaded all the data into a SQLite database, and had his analysis done before he could even set up a cluster with Spark. The proud moment comes when he tells his boss that we already had the analysis done despite his warning that it might take a few days because "big data". It was then that I got to tell him about this phenomenal new technology called SQLite and he set up a seminar where I got to teach big data engineers how to use it :)
P.S. If you do any of this sort of large dataset analysis in SQLite, upgrade to the latest version with every release, even if it means you have to `make; make install;` Seemingly every new release since about 3.8.0 has given me usable new features and noticeable query optimizations that are relevant for large query data analysis.
"I guarantee that somewhere, sometime, an engineer has been like 'hay guys, I loaded our big data into SQLite on my laptop and it ended up being faster than our fancy cluster'". We then joked that the engineer would be fired a few weeks later for not being a "cultural fit".
A few minutes later you commented with your story. I hope you didn't get fired? :)
Here is an example for SQLite: https://news.ycombinator.com/item?id=9359568
I quite enjoyed it as well.
(I'm getting better at it!)
EDIT: And my desktop machine has -- let's see -- about 500000 times the RAM my first computer had. Truly astounding if you think about it.
On a somewhat related note: The original Another World would probably fit into the caches that your CPU has as a matter of course these days.
Was it 2 TB before compression? Because sqlite does blow up the data size over the raw data usually (depending on the original format obviously). It can be kind of wasteful, and I ended up storing some fields as compressed JSON for this reason (that actually beats the sqlite format).
Also, the sqlite insert speed can be much slower than the disk's sequential write speed (even if you make sure you're not committing/flushing on every row, and if you have no indices to update, etc.)
So I think inserting and laying on the data could be nontrivial. But the queries should be fast as long as they are indexed. In theory, sqlite queries should be slower for a lot of use cases because it is row-oriented, but in practice distributed systems usually add 10x overhead themselves anyway...
The productivity factor of ruby or python in a single memory space vs hadoop is at least 10x.
Or are you talking about just rolling your own format to dump your data into? I've done that, but I'd still appreciate if I could use something that someone else had put the thought into. (Something like "cdb" by Daniel J. Bernstein, but that's an old 32-bit library that's limited to 4 GB files.)
Since this is Amazon, I suspect he is referring to SPICE (or their internal version), which was released last fall as part of AWS's QuickSight BI offering...
"SPICE: One of the key ingredients that make QuickSight so powerful is the Super-fast, Parallel, In-memory Calculation Engine (SPICE). SPICE is a new technology built from the ground up by the same team that has also built technologies such as DynamoDB, Amazon Redshift, and Amazon Aurora. SPICE enables QuickSight to scale to many terabytes of analytical data and deliver response time for most visualization queries in milliseconds. When you point QuickSight to a data source, data is automatically ingested into SPICE for optimal analytical query performance. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, machine code generation, and data compression to allow users to run interactive queries on large datasets and get rapid responses."
As far as architecture is concerned, every team is different. At least on the retail side, we tend to center around a large data warehouse that is currently transitioning from oracle to redshift, and with daily log parsing ETL jobs from most services to populate tables in that data warehouse. Downstream of the data warehouse is fair game for pretty much anything.
* SQL Server is proprietary
* SQL Server licenses are expensive
* SQL Server runs only on Windows (or least used to?)
Customer: "How much does an Oracle database license cost?"
Oracle Rep: "Well, how much do you have?"
To your own points:
SQL Server is as proprietary as Oracle
SQL Server is cheaper than Oracle
SQL Server is being ported to Linux in 2017 :)
Where do I learn how to do this? I've tried loading a TiB (one table one index) into SQLite on disk before, and it took forever. Granted this was a couple years ago, but I must be doing something fundamentally wrong.
I want to try this out. I've got 6TiB here of uncompressed CSV, 32 GiB ram. Is this something I could start tonight and complete a few queries before bed?
Actually, out of curiosity, I looked it up on the sqlite site. If I'm reading the docs correctly, with atomic sync turned off, I should expect 50,000 inserts per second. So, with my data set of 50B rows, I should expect to have it all loaded in ... Just 13 days. What am I missing?
There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.
Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.
question: is SQLite incrementally helpful when I'm already comfortable with a local pgsql db to handle the use case you suggested? would SQLite be redundant for me in this case?
question: between postgres and unix tools (sed, awk) is there reason to use SQLite?
Reasons to prefer sqlite: it's easier to embed in an app, you want sort, join, split-apply-combine, scale, transactions, compression, etc.
Reasons to prefer pgsql: sqlite's perf tools suck compared to pgsql (last time I got stuck anyway) and I'm sure there are lots of sql-isms that sqlite doesn't handle if that's your jam. EDIT: forgot everything-is-a-string in sqlite, just wanted to add that it has bit me before.
You will see PostgreSQL will come in handy once you get beyond the initial import stage..
PostgreSQL's type system will come to your aid . SQlite essentially treats everything as string, which can turn nasty when you get serious with your queries etc.,
I think part of the problem is many engineers can debug an application, but surprisingly few learn performance optimization and finding bottlenecks. This leads to an ignorance is bliss mindset where engineers assume they are doing things in reasonably performant ways and so the next step must be to scale, without even a simple estimate for throughput. It turns into bad software architecture and code debt that will cause high maintenance costs.
The universal knowledge is learning to run well designed experiments, and this comes from practice. It's like how you would debug code without a debugger. There are profiling tools in some contexts that help you run these experiments, but at the highest level simply calculating the number of bytes that move through components divided by the amount of time it took is very enlightening.
It's valuable to have some rough familiarity of the limits of computer architecture. You can also do this experimentally; for example, you could test disk performance by timing how long it takes to copy a file much larger than RAM. You could try copying from /dev/zero to /dev/null to get a lower bound on RAM bandwidth. You can use netcat to see network throughput.
Bandwidth is only part of the picture; in some cases latency is important (such as servicing many tiny requests). Rough latency numbers are in [0, 1], but can also be learned experimentally.
Many popular primitives actually dont perform that great per node. For example something like a single MySQL or Spark node might not move than ~10MB/s per node; significantly lower than network bandwidth. You can actually use S3 to move data faster if it has a sequential access pattern :)
Now, a data warehouse where you have a hoarde of Hadoop engineers slaving over 1000 node clusters is probably overkill unless you are a company like facebook or are in the data processing biz. However, some database management systems can be very complementary.
For example, while you can do a majority of things with Postgres, some are not fun or easy to do.
Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.
We find that MongoDB is great for providing the data our application needs at runtime. That being said, it's not as great when it comes time for us analyze it. We want SQL and all the business intelligence adapters that come with it.
Yeah you could do a join with the aggregation framework in MongoDB 3.2 but it just isn't as fun.
I think you should try it again. I felt the same way when it was in the 1.5 version, but post 2.0 it is really easy. And the documentation has gotten so much better. It really is as simple as `sudo apt-get install postgis` and then `CREATE EXTENSION postgis;" in psql.
Moving data from A to B and applying some transformations on the way through seems like a straightforward engineering task. However, creating a system that is fault-tolerant, handles data source changes, surfaces errors in a meaningful way, requires little maintenance, etc. is hard. Getting to a level of abstraction where data scientists can build on top of it in a way that doesn't require development skills is harder.
I don't think most data engineers are mediocre or find their job boring. The expectation from management is that ETL doesn't require significant effort is unrealistic, and leads to a technology gap between developers and scientists that tends to be filled with ad-hoc scripting and poor processes.
Disclosure: I'm the founder of Etleap, where we're creating tools to make ETL better for data teams.
I do love reading an article that supports my contention that ETL is the Charlie Work  of software engineering.
0 - http://www.avclub.com/tvclub/its-always-sunny-philadelphia-c...
Though I've never formally done work with the "ETL" label, what I've seen of it reminds me of the work I used to do for client years ago where I'd take some CSV file (or what have you) and turn it into something that their FoxBase-based accounting system could use. It was boring grunt work (or shall we say, "Charlie work"), but I billed it at my usual rate and it paid the mortgage. I would never, ever wish to make a career of it, however. (And if my assessment of what someone knee-deep in ETL does all day is completely off base, I apologize.)
BTW, that may sound polemic, but it's not -- that's really how a lot of business types, academic researchers, and others think of nearly all programming-related work.
Anyway, this is a shame. I thought "data science" still had nice R&D flavored jobs. Colour me disillusioned.
well yes this is an embarrassing fact for data scientists. A midrange stock macbook can easily handle a database of everyone on earth. In RAM. While you play a game without any dropped frames.
With all the hype around 'big data' and all that crap, many people seem to forget how far you can go with plain simple SQL when it's properly configured, and not talking about some complicated optimizations, just solid fundamentals. And no problem if you can't do it, things like Amazon RDS will help you.
Time and again, the tools prove that it's not.
To recollection, I can count the number of publicly-known companies dealing with these datasets on two hands, if being generous.
A 4 TB drive costs about $120, and you'll spend way more than that on software development and extra computers if you do distributed computing when you don't need to.