Hacker News new | past | comments | ask | show | jobs | submit login

... a highly specialized team of dedicated engineers...If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.

OMG, the author just described the last place I was at. Processed a few Tb of data and suddenly there's this R. Goldbergesque system of MongoDb getting transformed into PostGres...oh, wait, I need Cassandra on my resume, so Mongo sux0r now...displayed with some of the worst bowl of spaghetti Android code I've witnessed. The technical debt hole was dug so deep you could hide an Abrams tank in there. To this day I could not tell you the confused thinking that led them to believe this was necessary rather than just slapping it all into PostGres and calling it a day.

All because they were processing data sets sooooo huge, that they would fit on my laptop.

I quit reading about the time the article turned into a pitch for Stitch Fix, but leading up to that point it made a good case for what happens when companies think they have "big data" when they really don't. In summary, either a company hires skills they don't really need and the hires end up bored, or you hire mediocre people that make the convoluted mess I worked with.

This is so true. I do business intelligence at Amazon, and I've seen this play out millions of times over. The fetishization of big data ends up meaning that everybody thinks their problem needs big data. After 4 years in a role where I am expected to use big data clusters regularly, I've really only needed it twice. To be fair, in a complex environment with multiple data sources (databases, flat files, excel docs, service logs), ETL can get really absurdly complicated. But that is still no excuse to introduce big data if your data isn't actually big.

I really hate pat-myself-on-the-back stories, but I'm really proud of this moment, so I'm gonna share. One time a principal engineer came to me with a data analysis request and told me that the data would be available to me soon, only to come to me an hour later with the bad news that the data was 2 terabytes and I'd probably have to spin up an EMR cluster. I borrowed a spinning disk USB drive, loaded all the data into a SQLite database, and had his analysis done before he could even set up a cluster with Spark. The proud moment comes when he tells his boss that we already had the analysis done despite his warning that it might take a few days because "big data". It was then that I got to tell him about this phenomenal new technology called SQLite and he set up a seminar where I got to teach big data engineers how to use it :)

P.S. If you do any of this sort of large dataset analysis in SQLite, upgrade to the latest version with every release, even if it means you have to `make; make install;` Seemingly every new release since about 3.8.0 has given me usable new features and noticeable query optimizations that are relevant for large query data analysis.

Me and a coworker were laughing at the parent comment, and I told him:

"I guarantee that somewhere, sometime, an engineer has been like 'hay guys, I loaded our big data into SQLite on my laptop and it ended up being faster than our fancy cluster'". We then joked that the engineer would be fired a few weeks later for not being a "cultural fit".

A few minutes later you commented with your story. I hope you didn't get fired? :)

There was a HN story in which the author had the same "big data claim" (it was his intention to show there is very little big data in the real world, for most people out there) but instead he only needed to process couple GBs. He just used standard Unix commands and the performance was proven to be incredibly awesome.

Here is an example for SQLite: https://news.ycombinator.com/item?id=9359568

I think you're talking about this: http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

I quite enjoyed it as well.

I used to work somewhere that did a lot of ETL and we used MS databases - mainly, ahem, Access. At the time I had no idea about *nix. I've often thought since how much easier and quicker it would be to solve lots of the processing jobs with this sort of thing, but never thought about implementation details. This is a great reminder that new is not always best and reminds me that a little knowledge is a dangerous thing!

That guy's problem set allowed him to filter out massive percentages of his data from the get go.

Wow thanks! I couldn't find it in search :-) glad you found it.

The really fun aspect of this for me, personally, is that I've been doing computers for ages (25 yrs or so?) now and 3TiB still intuitively feels like a massive amount of data even though I think I have something like 6TiB free space on my home server disks... which, in total, didn't even cost as much as a months' grocery shopping. Sometimes it really takes effort to rid yourself of these old intuitions that don't really work any more.

(I'm getting better at it!)

EDIT: And my desktop machine has -- let's see -- about 500000 times the RAM my first computer had. Truly astounding if you think about it.

I can remember going through this with a 10MB file about 15 year ago. It felt like a lot after growing up with floppy disks. But even a modest CPU could iterate over it quickly, I just didn't realise. I just assumed I would need to process it in a database!


On a somewhat related note: The original Another World[1] would probably fit into the caches that your CPU has as a matter of course these days.

[1] https://www.youtube.com/watch?v=Zgkf6wooDmw

My first hard drive was only 2MB larger than the caches on my CPU.

That's awesome... I'm working with some 15-20 GB sqlite databases. Though 2 TB sounds kind of big?

Was it 2 TB before compression? Because sqlite does blow up the data size over the raw data usually (depending on the original format obviously). It can be kind of wasteful, and I ended up storing some fields as compressed JSON for this reason (that actually beats the sqlite format).

Also, the sqlite insert speed can be much slower than the disk's sequential write speed (even if you make sure you're not committing/flushing on every row, and if you have no indices to update, etc.)

So I think inserting and laying on the data could be nontrivial. But the queries should be fast as long as they are indexed. In theory, sqlite queries should be slower for a lot of use cases because it is row-oriented, but in practice distributed systems usually add 10x overhead themselves anyway...

I worked at a place that shall remain anonymous. An engineer later terminated for cause spent almost 2 months setting up a hadoop cluster with all the stuff that goes along with it to analyze some log data. When a senior engineer manager finally got incredibly tired of waiting for the results, said senior engineer manager wrote the desired analysis in an afternoon in ruby, and it ran in a couple hours mostly limited by the spindle speed on his laptop drive.

The productivity factor of ruby or python in a single memory space vs hadoop is at least 10x.

Fellow amazonian here. We switched from a massively distributed datastore (not to be named) to rodb for storage and found 10x improvement, not to mention eliminating cost and other head-aches; kind of expected since rodb is an embedded db...

I'd like to know what rodb is. Columnar DB? Columnar DBs often beat "big data" -I've beaten decent sized spark clusters in one thread of J.

Read only database. It is a hand optimized/compressed database engine that is used for big "once a day" data sets.

I'm Googling and finding a few things called "RODB" that don't quite match your descriptions. What in particular is it?

Or are you talking about just rolling your own format to dump your data into? I've done that, but I'd still appreciate if I could use something that someone else had put the thought into. (Something like "cdb" by Daniel J. Bernstein, but that's an old 32-bit library that's limited to 4 GB files.)

rodb is in-house, non-FOSS tech @ Amazon.

Thanks, that actually helps, particularly since I sell a competing product (which isn't RO).

What is rodb?

I'm guessing he means a "Real-time Operational Database". This seems to be a generic term for system like a data warehouse that contains current data, instead of just historical data. If you are taking the output of a Spark flow and storing it in Postgres or MongoDB or HBase for applications to query, then those could be considered RODBs.

Since this is Amazon, I suspect he is referring to SPICE (or their internal version), which was released last fall as part of AWS's QuickSight BI offering...

"SPICE: One of the key ingredients that make QuickSight so powerful is the Super-fast, Parallel, In-memory Calculation Engine (SPICE). SPICE is a new technology built from the ground up by the same team that has also built technologies such as DynamoDB, Amazon Redshift, and Amazon Aurora. SPICE enables QuickSight to scale to many terabytes of analytical data and deliver response time for most visualization queries in milliseconds. When you point QuickSight to a data source, data is automatically ingested into SPICE for optimal analytical query performance. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, machine code generation, and data compression to allow users to run interactive queries on large datasets and get rapid responses."


Read-only DB. Think of it as read-only memory mapped key/value store.

Like rocksdb or lmdb but optimized for reads. Read-only DB.

What technologies / architecture does Amazon use for business intelligence? I've just done a business intelligence course so I'm interested how do more "technology-centered" companies approach BI, whether it's the same thing I learned in the course (put everything into an integrated relational database if I simplify a lot).

Lots and lots of SQL (a very good thing IMO). This can come in the form of Oracle, Redshift, or any of the commonly available RDS databases (probably not SQL Server though). This is augmented with a lot of Big Data stuff, which used to be pretty diverse (Pig, Hive, raw Hadoop, etc.) but is moving very quickly towards a Spark-centric platform. There is occasionally some commercial software like Tableau/Microstrategy. Apart from that, it is a whole lot of homegrown technology.

As far as architecture is concerned, every team is different. At least on the retail side, we tend to center around a large data warehouse that is currently transitioning from oracle to redshift, and with daily log parsing ETL jobs from most services to populate tables in that data warehouse. Downstream of the data warehouse is fair game for pretty much anything.

Are you using Re:dash[1]? Seems like a good fit if you're doing lots of SQL and working with many different databases.

[1] http://redash.io/

Do you have somesort of framework for those ETL jobs? How do you handle ETL dependencies/workflow?(=thinking spotify's luigi[1] here)

[1] https://github.com/spotify/luigi

Would you know why sql server generally isn't used?

Probably because:

* SQL Server is proprietary

* SQL Server licenses are expensive

* SQL Server runs only on Windows (or least used to?)

I chuckled a bit because the GP mentioned Oracle. I'm sure you've heard this one:

Customer: "How much does an Oracle database license cost?" Oracle Rep: "Well, how much do you have?"

That's actually not a joke. Oracle liked to argue you should accept server licences as being a fraction of your budget, so when it went up you'd automatically pay them more.

Amazon actually does allow SQL Server. The poster saying it probably didn't was influenced by the fact that Azure, Microsoft's own cloud solution is an AWS competitor.

To your own points:

SQL Server is as proprietary as Oracle SQL Server is cheaper than Oracle SQL Server is being ported to Linux in 2017 :)

Amazon.com seems to strongly encourage the companies it acquires to use aws. Apparently, woot.com is fixing to rewrite the website in Java or something just so they do not have to use Windows. Apparently, it makes sense to them.

How long does it take you to load 2 TB into SQLite? How long do queries take? I believe you, but I'm in disbelief that it could be close to as efficient as throwing into ram. I mean, an EMR cluster takes like 5 minutes to spin up.

Where do I learn how to do this? I've tried loading a TiB (one table one index) into SQLite on disk before, and it took forever. Granted this was a couple years ago, but I must be doing something fundamentally wrong.

I want to try this out. I've got 6TiB here of uncompressed CSV, 32 GiB ram. Is this something I could start tonight and complete a few queries before bed?

Actually, out of curiosity, I looked it up on the sqlite site. If I'm reading the docs correctly, with atomic sync turned off, I should expect 50,000 inserts per second. So, with my data set of 50B rows, I should expect to have it all loaded in ... Just 13 days. What am I missing?

There was a little bit of unfair comparison. I didn't have to load the data over a busy network connection, and I didn't have to decrypt the data once it was loaded (I had the benefit of a firewall and the raw data on a USB3 hard drive). I think there was a conversion from CSV to parquet on the cluster as well. And the engineer who set up the cluster was multitasking so I'm sure there was some latency issues just from that. But my analysis still only took a few hours (5? Maybe 6?).

There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.

Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.

Good tips. I'll give it a try!

thanks for the SQLite tip, I've been meaning to add it to my tool set.

question: is SQLite incrementally helpful when I'm already comfortable with a local pgsql db to handle the use case you suggested? would SQLite be redundant for me in this case?

question: between postgres and unix tools (sed, awk) is there reason to use SQLite?

Reasons to prefer sed/awk: you're in bash

Reasons to prefer sqlite: it's easier to embed in an app, you want sort, join, split-apply-combine, scale, transactions, compression, etc.

Reasons to prefer pgsql: sqlite's perf tools suck compared to pgsql (last time I got stuck anyway) and I'm sure there are lots of sql-isms that sqlite doesn't handle if that's your jam. EDIT: forgot everything-is-a-string in sqlite, just wanted to add that it has bit me before.

Another reason to use Unix tools like sed and awk (and grep and sort and...) is that, if you just need to iterate over all your data and don't need a random-access index, they are really really fast.

don't forget bdb!

No. If you are already comfortable in PgSQL, there is no need to use SQLite.

You will see PostgreSQL will come in handy once you get beyond the initial import stage..

PostgreSQL's type system will come to your aid . SQlite essentially treats everything as string, which can turn nasty when you get serious with your queries etc.,

Hey, would you be willing to go a little in to your background/qualifications as well as maybe what your daily/weekly workload and goals look like? BI is a field I'm curious about because it sounds very interesting, very important, and as far as I can tell, with tons of surplus demand even for entry level. It makes me wonder if the actual day to day must be horrible for those things to be true.

In all fairness, one can run Spark on a single machine. The key insight and that "single machine" has gotten pretty big these days, the bit shuffling technology may be secondary.

My favorite is when you jump on a project and do a simple estimation of compute throughput for the highly complex distributed system, its something like hundreds of kilobytes per second. You could literally copy all the files to one computer and process faster than the web of broken parts. It becomes cancerous; to work around system slowness ever complex caching mechanisms and adhoc work is constructed.

I think part of the problem is many engineers can debug an application, but surprisingly few learn performance optimization and finding bottlenecks. This leads to an ignorance is bliss mindset where engineers assume they are doing things in reasonably performant ways and so the next step must be to scale, without even a simple estimate for throughput. It turns into bad software architecture and code debt that will cause high maintenance costs.

Would you happen to know a good way to learn performance optimization? I'm working with datasets currently that I am trying to get to run faster, and I cannot tell if the limitation is on my hardware or due to ignorance on my part

Unfortunately I'm not familiar with a single resource for this, which probably contributes to developers being unfamiliar. There is a lot of domain specific knowledge depending upon if you're working with graphics, distributed computing, memory allocators, operating system kernels, network device drivers, etc.

The universal knowledge is learning to run well designed experiments, and this comes from practice. It's like how you would debug code without a debugger. There are profiling tools in some contexts that help you run these experiments, but at the highest level simply calculating the number of bytes that move through components divided by the amount of time it took is very enlightening.

It's valuable to have some rough familiarity of the limits of computer architecture. You can also do this experimentally; for example, you could test disk performance by timing how long it takes to copy a file much larger than RAM. You could try copying from /dev/zero to /dev/null to get a lower bound on RAM bandwidth. You can use netcat to see network throughput.

Bandwidth is only part of the picture; in some cases latency is important (such as servicing many tiny requests). Rough latency numbers are in [0, 1], but can also be learned experimentally.

Many popular primitives actually dont perform that great per node. For example something like a single MySQL or Spark node might not move than ~10MB/s per node; significantly lower than network bandwidth. You can actually use S3 to move data faster if it has a sequential access pattern :)

[0] http://static.googleusercontent.com/media/research.google.co...

[1] https://gist.github.com/jboner/2841832

Like all things in tech, it depends. I agree with you that most startups will never see data that will not fit in memory. Remember http://yourdatafitsinram.com ?

Now, a data warehouse where you have a hoarde of Hadoop engineers slaving over 1000 node clusters is probably overkill unless you are a company like facebook or are in the data processing biz. However, some database management systems can be very complementary.

For example, while you can do a majority of things with Postgres, some are not fun or easy to do.

Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.

We find that MongoDB is great for providing the data our application needs at runtime. That being said, it's not as great when it comes time for us analyze it. We want SQL and all the business intelligence adapters that come with it.

Yeah you could do a join with the aggregation framework in MongoDB 3.2 but it just isn't as fun.

> Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.

I think you should try it again. I felt the same way when it was in the 1.5 version, but post 2.0 it is really easy. And the documentation has gotten so much better. It really is as simple as `sudo apt-get install postgis` and then `CREATE EXTENSION postgis;" in psql.

I think there's a more common reason why companies end up with "awful-to-work-with messes": ETL is deceptively simple.

Moving data from A to B and applying some transformations on the way through seems like a straightforward engineering task. However, creating a system that is fault-tolerant, handles data source changes, surfaces errors in a meaningful way, requires little maintenance, etc. is hard. Getting to a level of abstraction where data scientists can build on top of it in a way that doesn't require development skills is harder.

I don't think most data engineers are mediocre or find their job boring. The expectation from management is that ETL doesn't require significant effort is unrealistic, and leads to a technology gap between developers and scientists that tends to be filled with ad-hoc scripting and poor processes.

Disclosure: I'm the founder of Etleap[1], where we're creating tools to make ETL better for data teams.

[1] http://etleap.com/

I thought it nailed a lot of dynamics at my last gig as well. Of course, I joined for the challenges of scaling and wrote an ETL framework in Rails (that was a mixed bag but very instructive) and then got bored after I realized how small our data really was. Then I left for Google and all my data went up by two prefixes.

I do love reading an article that supports my contention that ETL is the Charlie Work [0] of software engineering.

0 - http://www.avclub.com/tvclub/its-always-sunny-philadelphia-c...

I do love reading an article that supports my contention that ETL is the Charlie Work [0] of software engineering.

Though I've never formally done work with the "ETL" label, what I've seen of it reminds me of the work I used to do for client years ago where I'd take some CSV file (or what have you) and turn it into something that their FoxBase-based accounting system could use. It was boring grunt work (or shall we say, "Charlie work"), but I billed it at my usual rate and it paid the mortgage. I would never, ever wish to make a career of it, however. (And if my assessment of what someone knee-deep in ETL does all day is completely off base, I apologize.)

The thing is, I kinda identify with Charlie. I rather enjoy getting it done and it's the kind of thing that keeps the ship sailing smoothly. Sometimes you can derive a strange appreciation from thwacking rats in the basement, so to speak.

It's all Charlie Work for the VCs, "founders", and (occasional) early hires who take in the lion's share of profits and prestige for the long hours most of us pour into our jobs -- day after day, year after year.

BTW, that may sound polemic, but it's not -- that's really how a lot of business types, academic researchers, and others think of nearly all programming-related work.

Not that polemic indeed, but only if we're willing to stop gazing our own navels. Most of us here aren't hired with nice paycheks because we're smart and creative, but because we're smart, creative, and yet without the sense to find better work than Charlie Work.

Anyway, this is a shame. I thought "data science" still had nice R&D flavored jobs. Colour me disillusioned.

> data sets sooooo huge, that they would fit on my laptop

well yes this is an embarrassing fact for data scientists. A midrange stock macbook can easily handle a database of everyone on earth. In RAM. While you play a game without any dropped frames.

This is very very true, very often when it comes to more than a gigabyte of data or more then a hundred of queries a second. Quite a few inexperienced devs suddenly think it's big data they've been reading about and FINALLY they can play with it!

With all the hype around 'big data' and all that crap, many people seem to forget how far you can go with plain simple SQL when it's properly configured, and not talking about some complicated optimizations, just solid fundamentals. And no problem if you can't do it, things like Amazon RDS will help you.

People frequently think that their sub-petabyte dataset is a big data problem.

Time and again, the tools prove that it's not.

To recollection, I can count the number of publicly-known companies dealing with these datasets on two hands, if being generous.

Indeed. Whenever I give a talk about "big data" I make sure to include the phrase, "I'm pretty sure most of the big data market exists on the hubris that developers want to desperately believe their problems are bigger than they really are."

It depends a lot on what you're doing, but 2 TB in general seems a lot. If you have to perform an out-of-disk sort, you probably need a distributed setup. The funny thing is that most setups that use EMR defeat the data locality principle and I think that's the speedup people experience when they run it on a single laptop for example. Reading the original Google paper helped me a lot in understanding this.

If you have to perform an out-of-disk sort, buy another hard disk and now it's not out-of-disk anymore. This will suffice for nearly any data set you would ever need to sort.

A 4 TB drive costs about $120, and you'll spend way more than that on software development and extra computers if you do distributed computing when you don't need to.

Just to clarify, 2 TB everyday right?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact