The data is reasonably structured so I think we can easily use a SQL database with possibly some XML or JSON columns. This would be much easier and quicker to set up.
Is 1TB a size that makes sense for Hadoop? Are there any alternatives like Google BigQuery, MongoDB or others? Sorry, I am not up to date with the latest cloud offerings. Also, we are in the medical field so this raises some security questions.
Hadoop is a marvelous way to make simple things complicated. Look at what Google does internally (and has been doing internally for many years) with structured data: stick it in a fast structured place (bigQuery, or even MySQL before that, if you go back far enough with AdWords) on a redundant platform and be done with it.
Hadoop, Spark, etc are for unstructured stuff where you're not sure how to attack the problem. If you know how to store and query the data then you know how to attack the problem.
UPS had billions upon billions of rows in their package tracking system long before anyone came up with MapReduce at scale. If you've only got 1TB of data cropping up per year, it's reasonable to stick it in a database of some sort.
If you've got 5PB of new data a year, then you need to start thinking about moving the computation to the data; but even then, sometimes that just means writing stored procedures.
Hadoop ecosystem makes exploration easy since it'll support any type of computation (graph, text mining, ML model building/validation). If all you're looking to do is random-access of the data with a small number of known filters/joins at large scale, or you're under no time constraint to explore, then sufficiently optimised DB will be most efficient (and probably cost effective way). Hadoop is a trade-off for general purpose everything, but at cost of infra complexity and computational inefficiency.
As for the original question - I think most people realize that the big in big data rarely refers to the actual size (volume) of data but rather that we still do not have very efficient techniques for doing certain kinds of data processing when the underlying data simply won't fit into the relational model. A good example is text - there is a reason why Google uses some kind of inverted index and not a relational database in which it stuffs all the web page text.
I don't really know if the medical field has any use for text mining, graph processing and the like. While recommending Hadoop just because the size is in excess of 1 TB looks knee jerk, someone being down-voted for a fairly non-opinionated comment where they suggest exploring use cases before deciding is even more knee jerk.
It does. Although the graph processing use cases that I've seen recently (and their developers) are better served by a faster query engine than e.g. Spark has provided.
Most people, if they can get away with Excel, won't use an RDBMS. (Of course this means that eventually someone will have to come along and scrape all the damned spreadsheets into an RDBMS, but...)
Most people, if they can get away with an RDBMS, won't use Hadoop. (Of course, sometimes you end up with something like DB2 file pointers, which really just say "hey look here's a pile of unstructured data that we couldn't figure out how to handle, and this is where we left it", and then of course someone has to put it somewhere useful, but...)
Now if you're moving around copies of the Internet, or trillion-row "databases" that need nearly instantaneous OLAP, then yeah, you'll be needing a proper distributed infrastructure. However, sometimes you can just rent that proper infrastructure from a vendor with that problem (e.g. Google or Amazon) and then you don't have to support it.
Things get really interesting (as in bleeding edge research interesting) when none of the above solve your problem. But they also tend to push the time horizon for results way out.
JMHO. Eventually software eats everything. It's a question of time scales. If you need results next week, don't rebuild TensorFlow or Redshift from scratch.
I keep hearing about Hadoop and Spark for graphs and models, but in actual practice (WHICH DEPENDS ON THE SIZE AND SHAPE OF THE DATA), an awful lot of data-feeding problems are more easily solved by columnar data stores, or graph data stores. For text mining and NLP, you are almost certainly better off with redundant unstructured storage due to the fundamentally unstructured nature of text and communications. For massively parallel model exploration sometimes it's better just to spin up a bunch of huge EC2 instances. Hadoop and/or Spark aren't necessarily critical aspects of solving the problem. Once you get a redundant, distributed infrastructure to run things on, you may not need to jerk around with name nodes and boring HA grunt work. Sometimes an existing purpose-built implementation (often on top of well-oiled infrastructures) does the trick. Sometimes not.
One interesting project I've kept an eye on (OK I lied, I'm a contributor) is storage of genomic data against a distributed, graph-structured reference. It's really fucking hard. Even with Google, UCSC, the Broad, and the usual heavyweights involved, the milestones have occurred on a timescale of years, and the eventual adoption is projected on a scale of decades. This is with some of the best in the world working on it, in every time zone. Again: DECADES.
So... I'm not against distributed computation. But you had better damned well know what you're getting into. I worked at Google. I work on GA4GH. If you can afford to take the long view, maybe your problems are complicated enough to invest in that sort of infrastructure.
But maybe they aren't, and Amazon or Google or Microsoft has already built what you need because they needed it too, and they hired a hundred graduates from the best engineering schools in the world to support their implementation. Is it worth reinventing the wheel when there are people racing at Formula 100 level out there? Only you can answer that.
There's a reason people don't swat flies with Buicks. Sometimes all you really need is a flyswatter.
Without too much details, I am working on a project that handles automated reports/metrics and our only problem with Postgres has been our write-heavy work load (554 M transactions a day with ~250 GB a week). This is a still very early goings and a fraction of our "production" target scale, but we haven't had any read issues.
Our problem is that our constant writes to tables mean that our checkpoints [http://dba.stackexchange.com/questions/61822/what-happens-in...] started to take significant periods of time and happen more and more frequently. Postgres also has some write amplification [http://blog.heapanalytics.com/speeding-up-postgresql-queries...] and VACUUMING challenges [https://www.postgresql.org/docs/current/static/routine-vacuu...].
But again these issues are specifically due to our write-heavy, timeseries data. For now we mitigate the effects with sharding and partitioning as we transition to Cassandra...but it sounds like you don't have a similarly write-heavy workload. So I think you might be able to get away with just Postgres.
> But my data is 100GB/500GB/1TB!
> A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it.
Obviously, you should look at the role I/O costs can play. A lot of RAM and some SSD drives might be a better idea than Hadoop at this scale.
Finally, for this amount of data, you could easily build SSD based servers, load it up with lots of RAM and slap any open source DB on them too. It'll most likely be sufficient for your purposes.
I'd strongly recommend against building an in-house system. You're going to spend hundreds of thousands of dollars in developer salary and hardware, and won't get that much out of it.
I'm sure AWS and Google both offer HIPAA-compliance, though you might have to pay more.
Solr or elasticsearch may be other options that can work with few instances and moderate horsepower.
I assume you don't have someone/some system 24/7 pouring over the data so you could look into hosted notebooks like databricks, qubole (and gleaning from people I met at a recent conference any of the bazillion that are about to launch), or host zeppelin yourself or use AWS EMR.
And don't lose your source data for that moment when you do figure out what you want to do with the data and you realise you need to reformat it.
We couldn't justify Hadoop until we were +6TB... that was some years ago, pre ssd and pre big disks - our old rdbs server had got horribly complicated because it hit that scale - when it got to 10TB its raid controller corrupted the control files for the db. We had support and the consultants who came did a bunch of mind blowing work to restore it.
But by then we had everything on Hadoop and we were only maintaining one legacy app on the old box - which we had told the users was dead and buried in any case (but kept to maintain friendships and so on).
Back to my main point - some people think that analytics starts with a question, I don't, I think it starts with the data and instruments to inspect and understand the data. I think that you should get the data on some sensible box, put a good database on there that your guys can use (as they are SQL folks get an SQL one) and put R, R-studio and R-shiny there and start inspecting and understanding it.
If no questions or insight arises then little is lost and it can go on a cheap fileserver until someone needs or understands it. If nuggets of gold start appearing then you can invest further in both skills and kit. I would recommend Hadoop either if other datasets of the 10gb+ scale start appearing or if this one gets >4TB. The other database recommendation would be because of mega joins - which Hadoop does well, and the general need to build a cheap EDW - if you have lots of cash you can build an expensive EDW instead! 4TB because currently disks are ~6TB and there are nasty scaling bumps in not Hadoop world that start kicking in as per my experience.
IT are right to charge once this is established to be valuable; if it's work £250k a year to your business then it makes sense to be spending £50k a year to put SLA's and resiliance on it. If it's worth £2k... well...