Hacker News new | past | comments | ask | show | jobs | submit login
Practical advice for analysis of large, complex data sets (unofficialgoogledatascience.com)
232 points by yarapavan on Nov 9, 2016 | hide | past | favorite | 32 comments

I hope some experienced data analytics people will read this thread so here is a slightly unrelated question: We have a data set of 1 TB growing at 1 TB/year we need to analyze. Our IT is pushing for Hadoop but this involves a lot of integration work because they have no plumbing ready. The whole thing just feels way too complex for our use case.

The data is reasonably structured so I think we can easily use a SQL database with possibly some XML or JSON columns. This would be much easier and quicker to set up.

Is 1TB a size that makes sense for Hadoop? Are there any alternatives like Google BigQuery, MongoDB or others? Sorry, I am not up to date with the latest cloud offerings. Also, we are in the medical field so this raises some security questions.

if it's structured you probably don't need Hadoop. Stuff it in BigQuery or RedShift and be done with it.

Hadoop is a marvelous way to make simple things complicated. Look at what Google does internally (and has been doing internally for many years) with structured data: stick it in a fast structured place (bigQuery, or even MySQL before that, if you go back far enough with AdWords) on a redundant platform and be done with it.

Hadoop, Spark, etc are for unstructured stuff where you're not sure how to attack the problem. If you know how to store and query the data then you know how to attack the problem.

UPS had billions upon billions of rows in their package tracking system long before anyone came up with MapReduce at scale. If you've only got 1TB of data cropping up per year, it's reasonable to stick it in a database of some sort.

If you've got 5PB of new data a year, then you need to start thinking about moving the computation to the data; but even then, sometimes that just means writing stored procedures.

Delighted Redshift customer chiming in that this is the right answer. BigQuery or Redshift will meet your analysis needs and they both support HIPAA compliance. 1TB is not a lot of data for these platforms so you're not gonna spend time building out a ton of infrastructure.

This, SOOO much this! Please don't enter the morass that is Hadoop / Hive / Presto / Spark [esp. Spark] unless you really, really need to. Redshift sounds really good for your needs.

Hadoop etc is not just about raw scale. Scale is relative to what you are hoping to do with the data - so make the decision based on what you want to do with the data.

Hadoop ecosystem makes exploration easy since it'll support any type of computation (graph, text mining, ML model building/validation). If all you're looking to do is random-access of the data with a small number of known filters/joins at large scale, or you're under no time constraint to explore, then sufficiently optimised DB will be most efficient (and probably cost effective way). Hadoop is a trade-off for general purpose everything, but at cost of infra complexity and computational inefficiency.

I actually cannot believe your comment is being down voted. The mods need to really see what is going on here.

As for the original question - I think most people realize that the big in big data rarely refers to the actual size (volume) of data but rather that we still do not have very efficient techniques for doing certain kinds of data processing when the underlying data simply won't fit into the relational model. A good example is text - there is a reason why Google uses some kind of inverted index and not a relational database in which it stuffs all the web page text.

I don't really know if the medical field has any use for text mining, graph processing and the like. While recommending Hadoop just because the size is in excess of 1 TB looks knee jerk, someone being down-voted for a fairly non-opinionated comment where they suggest exploring use cases before deciding is even more knee jerk.

fyi: re: "I don't really know if the medical field has any use for text mining, graph processing and the like."

It does. Although the graph processing use cases that I've seen recently (and their developers) are better served by a faster query engine than e.g. Spark has provided.

Most people, if they can get away with Excel, won't use an RDBMS. (Of course this means that eventually someone will have to come along and scrape all the damned spreadsheets into an RDBMS, but...)

Most people, if they can get away with an RDBMS, won't use Hadoop. (Of course, sometimes you end up with something like DB2 file pointers, which really just say "hey look here's a pile of unstructured data that we couldn't figure out how to handle, and this is where we left it", and then of course someone has to put it somewhere useful, but...)

Now if you're moving around copies of the Internet, or trillion-row "databases" that need nearly instantaneous OLAP, then yeah, you'll be needing a proper distributed infrastructure. However, sometimes you can just rent that proper infrastructure from a vendor with that problem (e.g. Google or Amazon) and then you don't have to support it.

Things get really interesting (as in bleeding edge research interesting) when none of the above solve your problem. But they also tend to push the time horizon for results way out.

JMHO. Eventually software eats everything. It's a question of time scales. If you need results next week, don't rebuild TensorFlow or Redshift from scratch.

Disclaimer: not only did I upvote you but I totally agree with you. HOWEVER, it's not clear to me whether people are comprehending the tradeoff you describe. That is CRITICAL.

I keep hearing about Hadoop and Spark for graphs and models, but in actual practice (WHICH DEPENDS ON THE SIZE AND SHAPE OF THE DATA), an awful lot of data-feeding problems are more easily solved by columnar data stores, or graph data stores. For text mining and NLP, you are almost certainly better off with redundant unstructured storage due to the fundamentally unstructured nature of text and communications. For massively parallel model exploration sometimes it's better just to spin up a bunch of huge EC2 instances. Hadoop and/or Spark aren't necessarily critical aspects of solving the problem. Once you get a redundant, distributed infrastructure to run things on, you may not need to jerk around with name nodes and boring HA grunt work. Sometimes an existing purpose-built implementation (often on top of well-oiled infrastructures) does the trick. Sometimes not.

One interesting project I've kept an eye on (OK I lied, I'm a contributor) is storage of genomic data against a distributed, graph-structured reference. It's really fucking hard. Even with Google, UCSC, the Broad, and the usual heavyweights involved, the milestones have occurred on a timescale of years, and the eventual adoption is projected on a scale of decades. This is with some of the best in the world working on it, in every time zone. Again: DECADES.

So... I'm not against distributed computation. But you had better damned well know what you're getting into. I worked at Google. I work on GA4GH. If you can afford to take the long view, maybe your problems are complicated enough to invest in that sort of infrastructure.

But maybe they aren't, and Amazon or Google or Microsoft has already built what you need because they needed it too, and they hired a hundred graduates from the best engineering schools in the world to support their implementation. Is it worth reinventing the wheel when there are people racing at Formula 100 level out there? Only you can answer that.

There's a reason people don't swat flies with Buicks. Sometimes all you really need is a flyswatter.

For 1 TB of data and 1 TB of data growth yoy, you might be able to get away with vanilla Postgres with reasonable sharding/partitioning/data structure.

Without too much details, I am working on a project that handles automated reports/metrics and our only problem with Postgres has been our write-heavy work load (554 M transactions a day with ~250 GB a week). This is a still very early goings and a fraction of our "production" target scale, but we haven't had any read issues.

Our problem is that our constant writes to tables mean that our checkpoints [http://dba.stackexchange.com/questions/61822/what-happens-in...] started to take significant periods of time and happen more and more frequently. Postgres also has some write amplification [http://blog.heapanalytics.com/speeding-up-postgresql-queries...] and VACUUMING challenges [https://www.postgresql.org/docs/current/static/routine-vacuu...].

But again these issues are specifically due to our write-heavy, timeseries data. For now we mitigate the effects with sharding and partitioning as we transition to Cassandra...but it sounds like you don't have a similarly write-heavy workload. So I think you might be able to get away with just Postgres.



I only know Hadoop from reading about it on HN, and the blog post I remember the most is "Don't use Hadoop - your data isn't that big"



> But my data is 100GB/500GB/1TB!

> A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it.

There's also this [0] on how to outperform Hadoop with command line tools.

Obviously, you should look at the role I/O costs can play. A lot of RAM and some SSD drives might be a better idea than Hadoop at this scale.

[0] http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

It remains true. 4TB now is like $100 too, and SSDs that size are around even if they're an order of magnitude more expensive. (And then there's Seagate's 60 TB SSD prototype, clearly the future is with SSDs.)

1TB of data per year is small in today's terms. Seconding a lot of other comments here, the easiest option would be to just use redshift (amazon) or bigquery (google). You'll have to do some initial work to load this data up into one of those cloud databases but again, 1TB is not a lot. AWS and Google cloud both have some special handling of data for HIPAA compliance too



Finally, for this amount of data, you could easily build SSD based servers, load it up with lots of RAM and slap any open source DB on them too. It'll most likely be sufficient for your purposes.

In my experience, if you don't have security or privacy requirements, you're much better off putting your data in an easy to autoscale environment like BigQuery, RedShift etc. than trying to fiddle with on prem or cloud Hadoop clusters. Hadoop requires non trivial amount of management and fine-tuning which you can do if you have the scale to deploy a specialist platform team.

I've had good experience with Amazon Redshift for a ~40TB dataset. Google BigQuery can also work pretty good, depending on how exactly you want to use it. You can use SQL or SQL-like languages to query your data, and it works without much hassle.

I'd strongly recommend against building an in-house system. You're going to spend hundreds of thousands of dollars in developer salary and hardware, and won't get that much out of it.

I'm sure AWS and Google both offer HIPAA-compliance, though you might have to pay more.

A team I work with has a similar sized dataset (started off larger but growing slower) and we have it wrapped up in an Elasticsearch database. I can't speak to it being a better tool for your uses since I don't have any experience with Hadoop but I can say that it was very easy to get set up and continues to be easy to use, so if you're worried about overhead it's worth a look.

postgres and a fair amount of RAM on a dedicated box plus some scripts to pull the data in. You will also find you can probably shrink the data set - dates can be converted from string to native representations, if certain large strings are repeated you can spin them off into their own table, etc.

Solr or elasticsearch may be other options that can work with few instances and moderate horsepower.

You could dump it into google big table, aws redshift or just run a local postgres/mysql on a few SSDs and admin it directly.

I told my manager we could probably buy a 5TB disk and run the whole thing on his laptop for the next few years with Postgres :-)

you should figure out what you plan on doing with the data first. That will help inform what you use and how you store it.

Nobody really knows. It's all new. In my view we need something we can play with without having to write lengthy requirement docs to IT.

I've no idea of what your doc requirements are, but if you don't have any real plans for the data, stick it in a cost effective format that requires little to no maintenance but that can be read by multiple systems. Say parquet or avro for example.

I assume you don't have someone/some system 24/7 pouring over the data so you could look into hosted notebooks like databricks, qubole (and gleaning from people I met at a recent conference any of the bazillion that are about to launch), or host zeppelin yourself or use AWS EMR.

And don't lose your source data for that moment when you do figure out what you want to do with the data and you realise you need to reformat it.

Remember you don't need all of the data for estimating models.

Do you expect to solve any known business problem with it, or you are just toying around in hopes that something comes out of it?

Right now there is no data at all so a lot of people hope they will find something useful. But nobody knows for sure.

what skills do you have in the team? also, budget & timescales.

SQL and general software development are the main skill. We are not data experts. The timeline should be as short as possible. I feel IT is making this into a big project with big budget to to justify their existence and charge a lot of money to our cost center and not acting in our interest. But I don't have hard data because I don't know that world well.

1TB sounds big, but if you were to get a server with a couple of SSDs and put a supported RDBMS on it you could do a lot. I think that you have a cast iron case with IT because you don't know what the outcomes are for the business yet - frankly who's going to sign off for a big implementation until you do?

We couldn't justify Hadoop until we were +6TB... that was some years ago, pre ssd and pre big disks - our old rdbs server had got horribly complicated because it hit that scale - when it got to 10TB its raid controller corrupted the control files for the db. We had support and the consultants who came did a bunch of mind blowing work to restore it.

But by then we had everything on Hadoop and we were only maintaining one legacy app on the old box - which we had told the users was dead and buried in any case (but kept to maintain friendships and so on).

Back to my main point - some people think that analytics starts with a question, I don't, I think it starts with the data and instruments to inspect and understand the data. I think that you should get the data on some sensible box, put a good database on there that your guys can use (as they are SQL folks get an SQL one) and put R, R-studio and R-shiny there and start inspecting and understanding it.

If no questions or insight arises then little is lost and it can go on a cheap fileserver until someone needs or understands it. If nuggets of gold start appearing then you can invest further in both skills and kit. I would recommend Hadoop either if other datasets of the 10gb+ scale start appearing or if this one gets >4TB. The other database recommendation would be because of mega joins - which Hadoop does well, and the general need to build a cheap EDW - if you have lots of cash you can build an expensive EDW instead! 4TB because currently disks are ~6TB and there are nasty scaling bumps in not Hadoop world that start kicking in as per my experience.

IT are right to charge once this is established to be valuable; if it's work £250k a year to your business then it makes sense to be spending £50k a year to put SLA's and resiliance on it. If it's worth £2k... well...

I forgot, what kind of data is this? Is it images? rows in a table?

What is your daily data growth rate? About what kind of the queries you are interested in? Does your regular query need to access the whole dataset, or some (filtered) subset of it?

If the data isn't changing much, might I suggest a self-contained DB like H2 or SQLite?

I love this blog. They had a great post describing a privacy preserving query proxy: http://www.unofficialgoogledatascience.com/2015/12/replacing...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact