
Don't use Hadoop when your data isn't that big (2013) - tosh
https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
======
lettergram
I have seen so so so many projects get bogged down by the need to use a "big
data" stack.

I think my favorite example, was a team that spent six months trying to build
a system to take in files, parse them, and store them. Files came through a
little less than one per second, which translated to about 100kb. This
translated to about 2.5Gb a day of data. The data only needed to be stored for
a year, and could easily be compressed.

They felt the need to setup a cluster, with 1Tb of RAM to handle processing
the documents, they had a 25 Kafka instances, etc. It was just insane.

I just pulled out a python script, and combine that with Postgres and within
an afternoon I had completed the project (albeit not production ready). This
is so typical within companies it makes me gag. They were easily spending
$100k a month just on infrastructure, my solution cost ~$400 ($1200 with
replication).

The sad part, is that convincing management to use my solution was the hardest
part. Basically, I had to explain how my system was more robust, faster,
cheaper, etc. Even side-by-side comparisons didn't seem to convince them, they
just felt the other solution was better some how... Eventually, I convinced
them, after about a month of debates, and an endless stream of proof.

~~~
quantumhobbit
The problem you had was that you were trying to convince management to act in
the best interest of the company rather than their own best interest. They
would much rather be in charge of a $100k a month project instead of a $1k a
month project. Also the big solution required a bunch of engineers working
full time which puts them higher up the ladder compared to your 1 engineer
working part time solution.

You were saving the company money but hurting their resumes.

~~~
osrec
Reminds me of almost every technology project I've seen in finance! It's all
about building complex empires, rather than simple, functional solutions.
Sometimes, it's also just about using up the assigned technology budget so
that it's not scaled down the following year!

~~~
rhizome
Funny, I had a phone screen today and in the midst of asking about how I saw
my future and what I wanted to work in, the recruiter was tiptoeing around a
situation where someone at the company had suggested React or something and
the devs had pushed back (successfully) on the basis that the site _didn 't
need it_. I got the feeling he was trying not to step on my dreams of being a
cutting-edge JS frontend trailblazer, but it is really a point in the
company's favor (to me) that they were able to resist the urge.

Basically, they are looking for web developers but it seems like they have to
filter out all the frontend ninja rockstars.

~~~
Rapzid
Why didn't they "need" it? How do they decide what they "need"?

This sounds like typical anti-change pushback, which I have learned can
actually be a good thing. However, this anecdote is severely lacking in
insight; much like most people's support of, or opposition to, change.
Further, like the widespread belief that sentences shouldn't start with
conjunctions; much less conjunctive adverbs.

~~~
paulddraper
Um, maybe they didn't need client side rendering? Plain 'ol server-side HTML
templating was faster and simpler?

Not every web project needs react, or even JS.

~~~
eropple
At this point, I'm not sure I agree. I am...not a fan of JavaScript, to put it
mildly (though ES6 does a lot to suck less). But for my money, nothing out
there is better at templating even static HTML than React and JSX. The
compositional nature of components is _way_ better than any other templating
stack I've ever used and it's pretty easy to bake it into static HTML (Gatsby
and `react-html-email` being awesome projects for it).

I'm sure there are declarative, object-oriented (rather than text-oriented)
templating engines out there that use an approach like React's. But I would
consider using an imperative, text-oriented templating language a yellow, if
not red, flag in 2017.

~~~
paulddraper
I use Twirl, included in the Play Framework (Scala/Java).

It is functional (well, as functional as React), and templates compile to
plain 'ol functions, so compatibility and static typing is the same as the
rest of the your program.

Obviously, if I needed a SPA or something, it's not what I would use, but
again, not everything should be an SPA.

~~~
eropple
You don't need to use React as a SPA, though, is what I'm saying. (When using
React, those component trees, too, compile to 100% plain-ol'-functions.)

Twirl is fine, insofar as it's attached to the Play (that's not to impugn you
for picking it, my history with Play is colorful and frustrating). I wouldn't
raise a flag for that. But _not_ using something in this vein definitely is,
and React is probably the most accessible way to do it for the 90-95% case of
developers.

------
nostrademons
The other point that's commonly missed is that Hadoop is really only useful
when both inputs _and_ outputs are too large to fit on one machine. If you
have a big-data input but a small-data output (which is very common in a lot
of exploratory problems), you can get away with a simple work-queue setup that
sends results into a shared RDBMS or filesystem.

At the beginning of my current project, I had a job that involved 35T of
input, but the vast majority of records would be ignored, and then for each
successful one, only a few hundred bytes of output would be generated. Rather
than Hadoop, I setup a simple system where a number of worker processes would
query Postgres for the next available shard, mark it as in-progress, and then
stream it from S3 and process it. When they finished, they'd write a CSV file
back to S3. The reduce phase was just 'cat'.

The resulting system took a few hours to build (few days, including the actual
algorithms run), and it was _much_ more debuggable than Hadoop would be. You
could inspect exactly where the job was, what shards had errored out, and
which were currently running on machines, and download & view intermediate
results before the whole computation finished. You could run the workers
locally on a MBP if you needed to debug a shard, with no setup needed.

When I was at Google, we had a saying that "The only interesting part of
MapReduce is the phase that's not in the name: the Shuffle". [That's the phase
where the outputs of the Map are sorted, written to the filesystem and
eventually network, and delivered to the appropriate Reduce shard.] If you
don't need a shuffle phase - either because you have no reducer, your reduce
input is small enough to fit on one machine, or your reduce input comes
infrequently enough that a single microservice can keep up with all the map
tasks - then you don't need a MapReduce-like framework.

~~~
lmm
> it was much more debuggable than Hadoop would be. You could inspect exactly
> where the job was, what shards had errored out, and which were currently
> running on machines, and download & view intermediate results before the
> whole computation finished. You could run the workers locally on a MBP if
> you needed to debug a shard, with no setup needed.

To the extent that's true it's an indictment of Hadoop's implementation. Doing
all those things in Hadoop ought to be trivial; maybe there are a few tools
you'd have to make a one-time effort to learn, but reusing them ought to save
you effort over making a custom system every time.

~~~
acdha
That was certainly the case in the past. We backed away from it ~5 years ago
after wasting too much time investigating cases where a daemon had leaked
memory, hit the heap limit, and failed in a way which caused work to halt but
produced no visible error state other than a stack trace in one of the many
log files. Local testing wasn't viable since you needed something like 16-20GB
to run stably, and back then we didn't have individual dev systems with enough
RAM to run that kind of overhead and still have enough left for the actual
work.

------
justinsaccount
One problem I've noticed is that there are no good "medium data" tools.

Column stores are crazy fast, but there isn't much simple tooling built around
things like parquet or ORC files. It's all gigantic java projects. Having some
tools like grep,cut,sort,uniq,jq etc that worked against parquet files would
go a long way to bridge the gap.

Something like pyspark may be the answer, I think it may be possible to wrap
it and build the tools that I want.. like

    
    
      find logs/ | xargs -P 16 json2parquet --out parquet_logs/
      parquet-sql-query parquet_logs/ 'select src,count(*) from conn group by src...'
    
    
    

I've been testing [https://clickhouse.yandex/](https://clickhouse.yandex/). I
threw it on a single VM with 4G of ram and imported billions of flow records
into it. queries rip through data at tens of millions of records a second.

Edit: another example... I have a few months of ssh honeypot logs in a
compressed json log file. Reporting on top user/password combos by unique
source address took tens of minutes with a jq pipeline. The same thing
imported into clickhouse took a few seconds to run something like

    
    
      select user,password,uniq(src) as sources from ssh group by user,password order by sources desc limit 100

~~~
kmax12
For "medium data", my company has found a lot of success using dask [0], which
mimics the pandas API, but can scale across multiple cores or machine.

The community around dask is quite active and there's solid documentation to
help learn the library. I cannot recommend dask enough for medium data
projects for people who want to use python.

They have a great run down of dask vs pyspark [1] to help you understand why'd
you use it.

[0] [http://dask.pydata.org/en/latest/](http://dask.pydata.org/en/latest/)

[1]
[http://dask.pydata.org/en/latest/spark.html](http://dask.pydata.org/en/latest/spark.html)

~~~
wakkaflokka
I've been trying to change all of my Luigi pipeline tasks from using Pandas to
Dask, so that I can push a lot more data through. Seems like an easy process
so far, and I like the easy implementation of parallel computing.

------
kareemm
There's a meta-pattern here. I've seen countless (especially younger) dev
teams focus on new and hot tech to the detriment of solving the business
problem. I've seen this happen in two, sometimes overlapping, cases:

1\. When the devs' agenda is about learning new tech rather than solving
business problems. The ways to solve this are to incentivize devs at the
business problem level (hard) or find devs who care more about solving
business problems instead of learning hot new tech (easier).

2\. When the product management function is weak within an org. Product
defines the requirements, and makes trade-offs around the solution. A strong
PM will recognize when a bazooka is being used to kill a fly, and will push
dev to make smarter trade-offs that result in a cheaper, faster, more
maintainable solution. This is especially challenging when the dev team cares
more about shiny tech than solving business problems.

~~~
brazzy
It's not necessarily devs pushing for hyped technologies that don't fit the
business problem.

Half- or nontechnical managers often follow tech hypes as well and may push
for the project to use "Big Data technology" simply because it makes them feel
more important to lead a project that is part of the hyped topic.

~~~
kareemm
Agreed! This falls into the "weak product management" category imho.

FWIW I think most devs could learn how to push back on poor decisions like
this made by non-technical managers as it really shouldn't be the manager's
responsibility to dictate technology choices for a solution. It _should_ be
their responsibility to push the dev team to make tradeoffs in order to
achieve a business result.

Devs who are good at understanding the business objectives and pushing the
non-technical team to make better decisions are both wonderful to work with
and command higher salaries / fees.

------
minimaxir
In 2017, with Spark's Catalyst engine and DataFrames data structure (allowing
SQLesque operations instead of requiring writing code in map-reduce
paradigms), you can have the best of both worlds in terms of big data
performance and high usability. Running Spark in a non-distributed manner may
sound counterintuitive, but it works well and makes good utilization of all
CPU, RAM, and HDD.

Spark is orders of magnitude faster than Hadoop, too.

~~~
matt_wulfeck
Spark is great, but honestly who writes pure MR jobs anymore anyway? At least
use hive so that your engineers can stick with sql syntax.

There's also Facebook's presto, which is so much faster than hive it will make
your head spin!

~~~
_dark_matter_
Spark is not pure MR. Spark has Spark Dataframes and Datasets with SQL-like
syntax. You can even write pure SQL and not mess with the Dataframe API at
all.

Presto is nice, but you can't use it for an ETL job. It is great for analysis.

~~~
electrum
Presto works great for ETL. It supports CREATE TABLE, INSERT and DELETE for
Hive data. Many companies use it for ETL. See examples here:

[https://prestodb.io/docs/current/connector/hive.html#example...](https://prestodb.io/docs/current/connector/hive.html#examples)

------
Fenrisulfr
I really like the article and it agrees with what we've been accomplishing at
my company. Datasets top out at 2 GB and clients ask for "big data" solutions
that don't make sense. My only complaint is that I don't feel comfortable
sharing it with my co-workers because of the vulgar image on there... :/

~~~
jastr
For datasets maxing out at 2Gb you can try www.CSVExplorer.com . It might be a
good way to give your clients access to their data.

Disclaimer: I built CSV Explorer.

------
aub3bhat
Honestly this is an outdated article. In 2017 Hadoop is a more like
OS/Platform/Ecosystem with FS (hdfs)/ Scheduler / Applications etc. Spark,
Presto, Hive now ensure that you no longer have to write Map-Reduce jobs.
While I understand the message about 600 MB being enough to fit in memory and
speed offered by command line tools etc. Yet its better to just use Spark
(which has local mode with convenient in-memory caching) so that when you have
to create a company wide data handling/processing system you can just "plug-
in" your code into a Spark/Presto/Hadoop cluster.

Finally if you are truly looking for speed while maintaining portability,
these days I would recommend using Docker containers with external volumes
created on tmpfs thus providing both speed of an in-memory implementation
while being agnostic to both OS and FS.

~~~
makapuf
Honestly, you also must factor out the price of distributed computing, and
take into account the fact that processing power and disk size have increased
a lot in that timeframe also.

Then you don't need a new OS/Platform/Ecosystem with FS (hdfs)/ Scheduler /
Applications to maintain, host and secure to compute thing that could be done
with local scripts, or even java / cython programs. or distribute on a few
servers what can (often) be distributed easily.

that still covers most of the use cases.

~~~
threeseed
Not sure that you understand what you're saying.

You don't need a new OS/Platform/Ecosystem to run Hadoop/Spark. It is
literally just half a dozen Java apps that you run on typical hardware on your
regular Linux OS. That is specifically what it was designed for. So rather
than write your own Java apps and distribute it yourself why not use a
platform that is proven, performant and with Spark will allow you to do things
that you could never write yourself.

~~~
makapuf
The OS/Platform/Ecosystem was a reference to the post I was replying to.

To be clearer, I don't think you need such things when you just need one or
maybe a few servers talking together when you remove the cost of parallelism
(see article about COST) and all the hadoop machinery. I have an OS, platform
and ecosystem, it's called unix and works for 95% of my needs.

Frankly I've seen my current project be implemented partly on single node (to
be replicated for later scalability) and partly multiprocess python. The
python is maybe 10x speedier and has 10% the code and run complexity. We tried
to run it on a shared cluster of hundred nodes for giggles, the python
finished running before the other one started launching. Not taking into
account the setup / maintenance.

The COST model is a good one indeed.

So I'm roughly paraphrasing the article (which I point anyone talking about
needing big data to, just before asking how big their data exactly is.)

------
johan_larson
Rule of thumb: if the data fits on an ordinary PC, it's not Big Data. The
right tool is probably an Ordinary Database.

~~~
threeseed
Rule of thumb: Don't make stupid rules of thumb.

Your point is simply not true. Many users of big data platforms e.g. Spark
aren't doing so because of the volumes of data. It's because they want to do
machine/deep learning on a proven and popular platform. With technologies like
Caffe, Sparkling Water, Tensorflow all available on the one platform.

~~~
bnegreve
Is Tensorflow available on Spark? That sounds strange.

------
atemerev
"Don't worry too much if you only have small data. The others' isn't as big as
they say, either".

------
elvinyung
There's a really good paper that somewhat formalizes this:
[https://www.usenix.org/system/files/conference/hotos15/hotos...](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-
mcsherry.pdf)

~~~
emmelaich
Excellent! Thanks.

------
AJRF
Imagine my glee when reading this thread after a long day of trying to get a
simple answer as to where to deploy a JAR / where data should be landed.

7 hours later, 4 email chains and around 12 people involved, and still no
answer, and they've all gone home...

I've been working in a project for an insurance conglomerate for the past 6
months (total project has been running close to a year) that has very recently
hit the rocks.

The client has gotten fed up of our inability to deliver, so canned the long
term plans of the project in favour of a stop gap solution in order to fulfil
some short-term business goal.

We've acquired a relatively ludicrous amount of hardware (i'm talking
$100,000+) for what is essentially the equivalent of SELECT COUNT(*) FROM ALL
TABLES, which is then filtered into some slow, unusable BI tool that costs
$999+ per user / per year.

The real tragedy is they already have geo-diverse SQLServer instances running
that would have allowed them to do this in 1 evening. 1 year and $1m+ in
billable's and we somehow are going to fail to deliver even that.

~~~
dheelus
Wonder which of the big consulting firms is running this one....

------
tosh
Related: [https://aadrake.com/command-line-tools-can-be-235x-faster-
th...](https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

------
curiousgal
But..but..all the cool kids are using it!

Seriously though, this also applies to a good chunk of ML "applications". Most
of times, a 'simple' SVM would work perfectly.

~~~
douche
Most of the time, a simple decision tree can handle "advanced AI"
requirements.

~~~
gerad
Or linear regression

------
murbard2
I may be mistaken here, but I thought the point of Hadoop wasn't merely to
hold more data, but to do larger, distributed, computations on that data. You
might have only 10GB of data but need to perform heavy computations on them,
requiring a large cluster, with each worker needing to exchange data with
other workers periodically.

~~~
727374
Yes, this and also parallelizing disk I/O. For example you could fit a 5TB
table on a single machine, but if you have an operation that requires doing a
full scan (e.g. uniqueness count over arbitrary dates), that will take a very
long time on one disk. Yes you could partition into multiple disks, but Hadoop
offers a nice generalized solution.

------
thibaut_barrere
Unfortunately still relevant... I can't count the number of times when
investors are pushing on a CEO which in turn starts to worry about scalability
& ask the team if Cassandra isn't a better fit etc (when the whole data can
indeed be kept in memory or some midsize USB key).

~~~
pkolaczk
Cassandra is not limited to use cases that need to scale-out, but it also
excells at stuff that needs to be always-on, regardless of data size. Sure,
you can do many nice things with data in memory or on a midsize USB key, but
when your centralized hardware fails, you have a problem and in the best case
you restore from a backup - and during downtime your customers go to your
competition.

------
jobu
Anyone have suggestions for tools to help with data analysis on moderate
numbers of very large records? We're working with close a million records per
year, which isn't "Big Data", but the records themselves usually contain over
a thousand fields. (Is that "Wide Data"?)

Right now the records are stored in JSON documents, and we're working on ways
to process those documents for reporting and analysis. There were some
previous attempts to store the records "normalized" in SQL, but it quickly
became difficult to modify and maintain.

~~~
cle
Depends on not only the input, but also the output. What kind of analysis do
you to perform?

~~~
jobu
We want to be able to generate things like pivot tables (charts, graphs, etc)
using location and/or time data from the records compared with many other
fields in the record. We're feeding a little of the data into Microsoft SSRS,
but setting it up is time-consuming and the results are slow and ugly.

I've done this in the past with OLAP cubes (SSAS) with much larger numbers of
small records. The difference here is there are so many more fields and we
have 20-30 new or modified fields each year. It doesn't seem like this should
be a unique problem, but I'm having a hard time finding tools that might make
it easier and less time-consuming. Maybe I'm just not asking the right
questions.

------
combatentropy
In this order, I recommend:

1\. Bash. Can you get the answer you need with awk, sed, sort, etc.? Linux
commands behave nicely with CPU and RAM, and "Command-line Tools can be 235x
Faster than your Hadoop Cluster" ([https://aadrake.com/command-line-tools-can-
be-235x-faster-th...](https://aadrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)).

2\. SQLite. The maximum database size is 140 terabytes, and SQLite can join
different database files together. First I would recommend using bash to cull
your data down before import, if possible
([https://www.sqlite.org/limits.html](https://www.sqlite.org/limits.html)).

3\. PostgreSQL. Again, I would recommend using bash to cull your data down
before import, if possible
([https://www.postgresql.org/about/](https://www.postgresql.org/about/)).

    
    
      Maximum Database Size		Unlimited
      Maximum Table Size		32 TB
      Maximum Row Size		1.6 TB
      Maximum Field Size		1 GB
      Maximum Rows per Table	Unlimited
      Maximum Columns per Table	250 - 1600 depending on column types
      Maximum Indexes per Table	Unlimited
    

More: "Bane's Law (Going Deep)"
([https://news.ycombinator.com/item?id=8902739](https://news.ycombinator.com/item?id=8902739))

------
Xcelerate
It always cracks me up what people consider "big data" to be. I once wrote a
few Julia scripts to process terabytes of data from the results of molecular
dynamics simulations. Just spin up a few high-memory cloud instances, process
for a couple hours, and then shut down. Total cost: about $12. It's amazing
what a couple billion CPU cycles per second can do. All without Hadoop, Spark,
etc.

------
pcsanwald
Something no one has mentioned directly: For startups selling software to
fortune 500 companies, they generally have to be very open about the
architecture of their software, how it all works, etc. Sales are always a
"build vs buy" decision with these companies.

If the architecture diagram looks complex, it makes for an easier sale, i.e.
"well clearly that would take us forever to build/integrate/whatever, so
buying seems easier" whereas if it's just a database and a app server, it's
can be a much harder sale.

obviously all the usual caveats apply, but in general, demonstrating
"technical complexity" to the customer is something a lot of sales and CEOs
will push for.

------
zitterbewegung
I'm not sure how valid the last point is (my Data is 5TB or bigger) when you
can create a 2TB (memory) x1 instance [https://aws.amazon.com/ec2/instance-
types/#memory-optimized](https://aws.amazon.com/ec2/instance-types/#memory-
optimized) . Also someone else mentioned using Spark's Catalyst engine and
DataFrames data structure. My previous experience with Spark it would
automatically swap the data so it would be feasible to work with 5tb of data.

~~~
vidarh
The article is 4 years old. Scale numbers accordingly. 5TB of data today can
fit in RAM on a single server.

------
vgt
Shameless pitch:

Google BigQuery has a 10GB free storage tier and a 1TB per month query tier.

If your data is small enough, you get a fully managed "serverless" analytics
database for very little, potentially free.

(work at G)

------
betolink
Previously on HN:
[https://news.ycombinator.com/item?id=6398650](https://news.ycombinator.com/item?id=6398650)

------
Tycho
How come the author, HN user yummyfajitas, has not posted for over a month?
Was he banned?

~~~
jaredtobin
Just conjecture, but I imagine that he got tired of the mods accusing him of
trolling (in his uniformly high-quality and civil commentary) due to his non-
standard political views.

------
chimerasaurus
This is still not an uncommon problem.

Interestingly, we (Cloud Dataproc team) have been trying to work in the
opposite direction. A few months ago we launched single-node clusters so
people can use Spark on one VM instead of creating crazy-huge clusters just
for lightweight data processing. No sense in using a a fleet of cores for
simple stuff. :)

Disclaimer - work for Google Cloud (and think we need better tools for simple
data processing)

------
philippeback
If one has tons of workloads (like in hundreds to thousands) to run
concurrently on a lot of the data, Hadoop is pretty solid.

YARN is also allowing interesting stuff.

------
luhn
AWS is launching instances from 4 to 16TB of memory in the coming year:
[https://aws.amazon.com/blogs/aws/ec2-in-memory-processing-
up...](https://aws.amazon.com/blogs/aws/ec2-in-memory-processing-update-
instances-with-4-to-16-tb-of-memory-scale-out-sap-hana-to-34-tb/)

Even terabytes of data isn't "big" anymore.

~~~
skj
In the era of "big data", TB was never considered big.

~~~
metaphorm
it's big enough that your conventional RDBMS and app server stack starts to
struggle with it

------
dis-sys
A 2016 work project I did requires to crawl 100 million web pages from around
100k different web sites each night. There was no existing infrastructure what
so ever. I was told that it is a big data project and internet scale
processing power is required (and actually budgeted for it).

Built a Python crawler to pull data from the Internet using EC2 spot
instances, as they are dirty cheap. After just a few days happy coding, I
eventually got to a place where I can reliably (spot instances are not that
reliable) pull the data, compress them and download them to our own data
centre for processing (policy rather than design decision) for under $50 with
the majority spent on transferring the final compressed output.

Big data? 100 million small files a run is a typical small data project, big
data means trillions of files a day, or something like exabytes scale data
volume.

~~~
speedplane
You're defining the term big data as "an amount of data that seems impressive
to me". That's a pretty loose definition, and two people could have wildly
different ideas.

I prefer to define big data as "an amount of data that cannot fit on any one
computer." Once you start using distributed systems, either with sharding,
NoSQL, or similar tech, you're in big data territory. Going from one computer
to two you greatly increase the complexity of the system.

------
pcarolan
More and more I reach for pyspark for medium sized tasks because I like the
API and if I need to scale it I can. I don't think this is as black and white
as it used to be now that big data framework interfaces are getting as good or
better than small data interfaces.

------
StanAngeloff
We fell in this "trap" as well. Whilst working on a marketing automation
system, we were integrating a Google Analytics/Piwik clone. Our guestimates
indicated we were going to be storing around 100GB of events per month. We
geared up and got working. The team built complex Hadoop jobs, Pig & Sqoop
scripts, lots of high-level abstractions to make writing jobs easier, lots of
infrastructure, etc. etc. After about 2 months we scrapped the "big data" idea
and redid everything in two weeks using PostgreSQL. As most of the queries
were by date, partitioning made a huge difference.

I recall one of the classes was named SimpleJobDescriptor. At the near end it
was 500+ lines long. Not so simple after all.

------
in9
I didn't even read the article, but I agree with the title already. The amount
of companies asking for knowledge in Hadoop for data only bigger then excel
limit is not trivial... At least in the area where I work, this is the case

~~~
cortesoft
I hate these sorts of titles, because some of us DO work with petabyte scale
datasets. I know the article does mention the types of use cases that Hadoop
is good for, but the title just comes across as arrogantly dismissive.

~~~
quickben
And we cheer for you guys. We really do. You push the edge, it's the right
thing to do.

Some of us though, from time to time, have these side company projects. Then
,after a round of consultants and meetings (and nobody asking us for input) we
had this 500MB database, and an architect who was pushing for all the big data
keyword to be on his resume, even though all math and business numbers said
we'd double that in _twenty years_.

We are talking zookeepers and 20 servers per geo separate datacenter, spark,
ml, Casandra, hadoop, MR, java everywhere, logstash, etc etc etc.
Presentations in hundreds or PowerPoint slides and endless meetings about
direction and future.

So the article, and MySQL, have it's places. Especially when one can finish
the job in two weeks.

~~~
cortesoft
Of course, I am not saying that there aren't people who misuse 'big data'
technology for 'small data' problems. You could even keep a provocative title
like, "Are you sure you need big data?" or, "Your data must be this big to
ride" or something else both cheeky and not dismissive.

------
barrkel
I've been around this loop over the past few months. Something to bear in
mind: Hadoop of 2017 is quite different to the Hadoop of 2013.

Our data isn't enormous, by any means - 160G in one particular instance that's
being used for proof of concept, it'll add up to 5-20T should it reach
production. The catch is that it's 160G in MySQL; it's only 5G or so once it's
been boiled down to Parquet files in HDFS. Columnar stores can be a really big
win, depending on the shape of your data.

We use Impala for our queries. It's quite good tech; it's much faster at table
scans than everything that doesn't describe itself as an in-memory database.
That means writing SQL much like you would with Hive, only it runs faster.

I tried out both Citus and Greenplum to give PostgreSQL a fair shot. First
problem: PostgreSQL is limited to 1600 columns in a table, and the column
limit for a select clause isn't much bigger. We have several times this number
of columns in our largest analytic tables. Not the end of the world, you can
cobble things together from joins and more special-purpose tables.

Second problem: CitusDB doesn't come OOTB with a column store, and it's far
too slow when using a row store. I didn't bother trying to compile the column
store extension to use with Citus; the pain ruled itself out. I continued
ahead with Greenplum, focusing on columnar storage - row storage is
consistently poor.

Third problem: Greenplum is cobbled together from a pile of duct tape, an
assembly of scripts and ssh keys to keep the cluster in sync. It does not
inspire the same kind of confidence for operational management as HDFS,
whether rebalancing the cluster, expanding the cluster, or decommissioning
nodes (not supported with Greenplum, AFAICT).

Fourth problem: Impala simply runs faster than the Postgres derivatives, and
its lead increases the more data you have. Impala seems to do table scans over
twice as fast on identical deployment environments.

Indexes only help when the operation being performed can use the index. As it
happens, most analytic queries do full scans, or have predicates that are
either not very selective (randomly skipping rows here and there) or are
really selective (date bucketing, which maps well to typical Hadoop
partitioning strategies). I had some hope that indexes would help for joins;
but Greenplum didn't elect to use my indexes, and when I forced their use, it
ran slower. The ancient version of Postgres that Greenplum is forked from
doesn't help much either, since it can't e.g. use covering indexes to avoid
looking back to the table.

If it was my startup, I'd take a risk on something like MonetDB, or look
harder at MemSQL, given what I've seen about how data has shrunk with column
stores. But from what I've seen and measured, Postgres doesn't really cut it
for analytic queries.

~~~
YCode
> PostgreSQL is limited to 1600 columns in a table, and the column limit for a
> select clause isn't much bigger. We have several times this number of
> columns in our largest analytic tables.

Is that typical for this kind of work?

~~~
barrkel
What's "this kind of work"? :) The company I work for does reconciliation as a
service, something that's pervasive in finance. We support customer-defined
schemas; in fact that's one of our selling-points. I could start talking about
the kinds of things we've done with MySQL to make this perform well - it's not
difficult, just a bit unorthodox.

Anyway, the diversity in customer schema leaks out into the Hadoop schema,
where we'd much prefer to give customers data using column names they're
familiar with, and we also want to give them rows from all their different
schemas in a single table (because many schemas have overlap by design). The
superset of all schema columns is large, however. The problem can be overcome
with more tooling - defining friendly views with explicit column choice - but
having the option to implement that (and go to market sooner), vs a
requirement to implement that, adds up to a distinct advantage for tech that
can support the extra columns.

~~~
jamespo
Edgar F. Codd just started spinning

~~~
barrkel
Something you need to bear in mind is that distributed joins are very
expensive; you have a better time designing your schema such that related data
can be placed logically close together, whether it's arrays / maps inside rows
(for one to many), or very wide rows (denormalizing what might be a star
schema).

(I know, in a column store having related data in another column isn't
actually close together; but it can be stepped through at the same time, it
doesn't need a join to be correlated, it's correlated naturally.)

------
yeukhon
This reminds me of
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462).
Use command line tools when applicable

------
Arcsech
I agree with the sentiment of the article, but why does this site ask for
notification permissions? It's a blog - if I want notifications about new
posts I'll add your RSS feed to my reader.

------
lars_francke
I'm biased in that I've been doing nothing but Hadoop consultancy for almost
10 years now.

I have done maybe a hundred to two consulting hundred projects based on Hadoop
(as you can imagine it's mostly short-term and troubleshooting stuff) and am a
ASF committer myself. So take this with a grain of salt.

I've seen the kinds of projects the article and lots of people in here refer
to and I agree with lots of what's been said.

I also agree that Hadoop (the whole ecosystem) probably not has a single "best
of category" tool that ticks all the checkboxes but IMO that's totally missing
the point.

What it gives you is: \- (Mostly) Open source stack (compare that to your
Informatica/Talend/IBM stack that big companies have and not to your Python
jobs they don't), lots of the code is crap (see my Apache JIRA history) but
it's there and you can attach a debugger to your running cluster, that's
fantastic and very hard to do with e.g. Python (at least I'm not able to find
good ways of doing it)

\- Proven technology, it works most of the times and if not there's tons of SO
answers, mailing lists, books.

\- All the boring enterprise stuff: Encryption at rest and on disk, strong
authentication, strong authorization, auditing, high availability, failover
etc.

\- Integration into Monitoring & alerting tools, distributions figure out the
important stats and thresholds for you

\- Mostly easy to operate these days even at large scale thanks to Cloudera
Manager et al. - no need to manually run/install stuff

\- "Coolness factor"

\- Costs way less than established tools with e.g. external storage filers or
similar specialized hardware

\- It's not trivial but also not super hard to find people with Hadoop/Spark
experience

\- Existing purchasing agreements with Hardware vendors or Cloud providers

\- Business/C-level gets and supports it

Edit: Three more I thought of:

\- You can buy licenses and support for the tools. Both of which have been
discussed multiple times here on HN, a donation doesn't work that well for
companies

\- You can buy indemnification from license problems from at least Cloudera
but I think Hortonworks as well

\- If the same stack has been used elsewhere in the company already chances
are there are processes in place, someone has already vetted it, done the open
source checks etc.

See what I don't mention are things like performance or absolute money amounts
or even the amount of data (as the whole ecosystem is much more than "big
data" now) etc. as we see surprisingly little projects that care about
specific SLAs or features. It's all relative.

That said: At least 50% of the projects would still be better served by a
simpler solution but that still leaves a whole big chunk of projects where
Hadoop makes sense for other than technical or monetary reasons.

But I very much disagree with this part of the post: > The only benefit to
using Hadoop is scaling.

------
coding123
I agreed with the article at the time it was posted. Now I use Spark for
various sized data sets. Specifically Zeppelin + Spark is a great combination.

Then again, Spark doesn't really need Hadoop, I see more and more people using
it with Kafka and Elasticsearch for all sorts of fun stuff.

And as other commenters pointed out, you get read-only SQL (very powerful SQL)
for free. The other day I joined an elasticsearch result with a CSV file in
SQL.

------
rbanffy
In a world you can buy a 24 TB RAM 8-socket 192-core 384-thread box able to
hold 20 2.5" devices and a mix of 16 devices picked from NVMe storage, GPUs or
other coprocessors (giving you up to 16 coprocessor hosts, each with 72 x86
cores, giving you 2304 threads and 256 GB of RAM), the odds of you actually
having big data (as in "intractable from a single box") are remarkably small.

------
hkmurakami
I still enjoy remembering a friend who works at one of the week known big data
companies tell me "our team works on tools for medium data"

------
EternalData
I feel like there should be a general law or rule of sorts -- don't use
technology for the sake of using technology.

------
usgroup
I'd add that the benefits of avoiding bigdata(tm) can extended further into
the 5TB+ space if you're happy to run at a bit of a delay.

I.e. If it's ok for the crunching to take 6 hours to produce intermediary
aggregates which can then be crunched in less time , you can avoid spark and
Hadoop for much longer.

------
code4tee
Nice article and it still holds despite being a few years old. There's a
general lack of understanding or appreciation for this "medium" data world
which is actually where most of "big" data lives (i.e. too big for Excel but
not PB or many TB).

------
webmonkeyuk
There's a very similar argument in this article that's trended on HN a couple
of times before:
[https://news.ycombinator.com/item?id=12472905](https://news.ycombinator.com/item?id=12472905)

------
sevensor
Yup, 600mb is pretty tiny. I have a netbook from 2011 that I used for
crunching numbers for my dissertation. It can handle 600mb csv files in memory
using numpy. The slowest part is reading from disk.

~~~
pvdebbe
The craziest example I saw was a blog post titled something like "How I used
big data to analyse my problem" and she was talking about 1200 rows. Big data
that could fit in a floppy disk!

Buzzwords are hell of a drug

------
rjurney
Don't use Hadoop even if your data is big. Spark replaced it.

~~~
threeseed
What are you talking about ?

Hadoop is basically HDFS + YARN. Spark is a replacement for MapReduce only.

~~~
rjurney
Except cloud computing. HDFS is only useful if you own hardware. HDFS is still
useful for these workloads, but calling it "Hadoop" when you are using "Spark"
as the execution engine doesn't make sense. You are using Spark, which depends
on HDFS for local installs only. And if you can be in the cloud, you should.

It is also worth noting that many people use Spark against a database like
Cassandra and not HDFS. So it isn't even universal for local installs.

I was a Hadoop evangelist, but its time has passed. It is a foundation, not
the tool you use to get work done.

------
jcoffland
> Don't use Hadoop when your data isn't that big

Did we really have to weaken the title's statement? Bending to the common
denominator will always lead to milquetoast titles.

------
abalashov
In my experience, there are not many problems in the course of everyday
business that cannot be solved with SQLite's CSV import feature and/or its
virtual tables.

------
mirekrusin
> Too big for Excel is not "Big Data"

This. Should be sold as posters with pre-printed "Happy birthday mr manager".

------
jakupovic
"There is no computation you can write in Hadoop which you cannot write more
easily in either SQL, or with a simple Python script that scans your files."

How about base64decode? it is possible in SQL but not pretty. There are other
examples as well. I do agree that Hadoop is probably an overkill for 90%+ of
uses but there are certain set of problems which are helped by having a
general execution environment.

~~~
vidarh
You conveniently ignored the "or with a simple Python script that scans your
files" part.

------
synaesthesisx
For 95% of applications Hadoop is overkill

------
ianamartin
The thing that's so frustrating to me--and many others, based on the comments
--is how much of this gets pushed to management and product people.

Our team just acquired a product manager for the first time. And for the most
part I absolutely adore her. She's really great at pacing things out to a
timeline that works for us and pulling us out of our idea that a hacked up
solution that you do in a couple of days isn't really a product. And she's
very much pushing against putting those into production, all of which I agree
with.

But she doesn't understand technology or needs. She's constantly having
meetings where she pitches our manager on product ideas that don't even make
sense compared to the client needs. Many of these are buzzword-based. But it
makes me seem like an asshole where every team meeting, I have to be like,
"No. That's not what the client asked for. The client asked for a solution to
this problem x. The client (doesn't matter if it's internal or external) often
doesn't know what the right solution is. Good product people and managers know
that to get to a good place, the key is to get the client to describe the
problem instead of the solution.

I'm not a product person. I don't know jack dick about products except for how
to make them. But the level of miscommunication I see even in our small
organization is astounding.

You get one manager, one product manager, and two stakeholders, and one
engineer in a meeting together. It's absolutely unbelievable to me what
management and product take away from those meetings.

It's so far from the reality of the problem to be solved that it sometimes
makes me nuts. Like it has today.

I think that our whole way of doing things and our hierarchy is completely
broken and backwards. You want to get something done? Put a software engineer
in a room with an operations person who is on the ground. Talk about the
problem. Propose a solution.

Take it to the management and let them sort out priorities among themselves.
This bullshit about management being the first point of contact virtually
guarantees that buzzword-based development is going to happen.

I know that we are supposed to be the sacred cows, and that we need to be
protected. Fuck that, I say. Let me interact with the people on the ground
whose problems I'm trying to solve. Don't put a non-technical product person
who doesn't get any of the details right between me and the end-user. I'll
take the time out of my day two days per week to go to the office and have
meetings, and I do the rest at home where I can be productive and write code.

Sorry for the rant, but fuck all of this. We have a very broken system, in
most cases.

------
shriphani
Hadoop's carbon footprint should make it one of the top threats to
civilization.

------
frozenport
Looking at the comments, one valid reason a company might be using big data
without actually having big data is the potential need to scale and grow.
Don't forget that many of them are expecting a 10x increase in users.

~~~
jononor
Don't forget that most of them never get there. Often because they are not
able to find & make the product which the market wants/needs. Sometimes partly
because engineering is too slow and unadaptive, possibly because they are
overcomplicating things unnecessarily.

------
PLenz
I don't know, 21PBs seems like a lot to me....

~~~
YCode
From the article:

> But my data is more than 5TB! Your life now sucks - you are stuck with
> Hadoop. You don't have many other choices (big servers with many hard drives
> might still be in play), and most of your other choices are considerably
> more expensive.

------
musgrove
Data are plural, eg: "Data arent't that big." I believe datum is singular.

~~~
ocb
Not true in the vast majority of usages. Maybe in scientific writing, but this
is a blog post.

