

Don't use Hadoop – your data isn't that big (2013) - isp
https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

======
jasode
_> If you have a single table containing many terabytes of data, Hadoop might
be a good option for running full table scans on it. _

The author mentions this at the end almost as a footnote but in my experience,
this is _usually_ the motivating reason for IT to use Hadoop. It's about
processing _throughput_ and not just absolute data size. A lot of processing
jobs can't use indexes (except for lookup tables) so a PostgreSQL 4TB db with
a schema for OLTP workloads doesn't solve the issue.

A 4tb harddrive might be cheap for $150 but at 100MB/sec, it takes 11+ hours
to read. However, a big data cluster with 10 nodes can each scan their local
400GB shard and the batch job gets done in 1 hour. Even newer 10TB mechanical
harddrives with 150MB/sec (and multi disks RAID to saturate a SATA3 interface)
isn't going to fundamentally alter this throughput equation. On the other
hand, a potential 4TB SSD with 1GB/sec might be the sweet spot if the
processing is not cpu-constrained. But then again, new technology enables the
goal posts to move[1], and people will invent new workloads that require a
100-node cluster each having a 10TB SSD to solve bigger problems.

[1][http://en.wikipedia.org/wiki/Induced_demand](http://en.wikipedia.org/wiki/Induced_demand)

~~~
bsg75
> A lot of processing jobs can't use indexes (except for lookup tables) so a
> PostgreSQL 4TB db with a schema for OLTP workloads doesn't solve the issue.

What are the cases where an RDBMS can't use indexes on fact tables?

~~~
sokoloff
Searching within fields is one example.

WHERE street_address LIKE "%MAIN%ST%"

WHERE xml_blob LIKE "%<deprecrated_feature_node_name%"

to find documents that need to be updated when you remove that feature, etc.

"Just use Mongo" or "just use the latest pgsql and convert to JSON" or "just
do XYZ" is often a less practical answer than "just query/analyze the database
you already have".

~~~
usetrigram
If you're using PostgreSQL you can just use a trigram index for these kinds of
searches.

\-
[http://www.postgresql.org/docs/current/interactive/pgtrgm.ht...](http://www.postgresql.org/docs/current/interactive/pgtrgm.html#AEN150325)

\-
[https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)

~~~
sokoloff
LOL at the new account name, but thank you very much for the pointer to an
index type I didn't know existed!

------
isp
Previous comments (2013):
[https://news.ycombinator.com/item?id=6398650](https://news.ycombinator.com/item?id=6398650)

Also worth a read, "Command-line tools can be 235x faster than your Hadoop
cluster (2014)": [http://aadrake.com/command-line-tools-can-be-235x-faster-
tha...](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html) |
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)

------
lemmsjid
At my company, we operate on very large amounts of data.

Let's pretend, though, that we woke up this morning and had a 600mb dataset.

We'd still have hundreds of jobs running on that 600mb dataset. Most of them
would ideally run as often as possible--we'd only schedule them periodically
because of performance. Some of them would explode the 600mb dataset into
several gigs. Most of them would copy the dataset several times over. Some of
them would be memory intensive. Some of them would be CPU intensive. Some of
them would be disk intensive. Many of those jobs emanate counters, logs, and
metrics, which we want to track over time. We'd want to track the overall
resource utilization of those jobs. We'd want the particular jobs to be killed
automatically if they use too many resources. If the jobs are in danger of
taking up too many resources in the system, we want them to be automatically
queued and scheduled piecemeal. We'd want the jobs to be written in a variety
of frameworks targeted to the particular task at hand.

Given those hundreds of jobs competing for scarce resources, we'd want to not
have to completely rewrite the system if the 600mb dataset became a 1 gig
dataset, or a 1 tb dataset.

Sure, we could put together a multi server ssh/bash job scheduling system, but
in the end that's kind of what hadoop is, except having addressed a lot of
problems that you don't even know you're going to have up front.

~~~
lmm
With a 600mb dataset you probably don't need batch jobs. You can afford to run
ad-hoc reports when they're requested. You can process data as it arrives,
streaming it into a structured representation in postgres or similar. You can
afford to stop the pipeline when you get an item whose format doesn't match,
and investigate, rather than having to tolerate a certain error percentage.
You don't need a scheduling framework because if you ever have too many jobs
running you can kill a few, you're not going to lose hours of work by doing
so. You want metrics but IME hadoop isn't actually very good at that.

If you're doing enough processing that you can't do it on one beefy database
server then sure, the concerns that go with that are similar to when you have
enough data that you can't handle it on one beefy database server. But most
people with 600mb of data don't want or need to do multiple cpu-hours worth of
processing on it.

------
leeleelee
A lot of people probably run into problems from poor design, poor programming,
or just inability to problem solve. For example (two real examples I've
faced):

(1) Company xyz is having problems generating reports that pull from an sql
database. They take too long, IT says it's because the data is too large and
their hardware is old. Obviously this is big data! Wrong, re-writing the sql
queries and indexing some fields solved the problem. Like, beginner-level
stuff. Nobody had even looked at the code or structure of the database when
initially trying to solve the problem.

or

(2) Company xyz can't load a dataset into memory to perform some computations
on it, it's too large. Needs better hardware or big data solution obviously.
With millions of rows and many columns, ok. Only 6 of those columns were even
used by the processing job, and a fraction of the rows were relevant. There
was no need to fit the entire dataset into memory in the first place, problem
solved.

I could go on. Lots of companies just lack the skills and knowledge to do
things right, and have no idea what's going on under the hood or how things
work. So naturally, they are drawn to the marketing buzz surrounding big data,
cloud computing, data warehousing, machine learning, etc etc etc.

~~~
polsoul
Being a dba for a decade already, resolving issues similar to the ones you
described, are the sweatest I've experienced in my professional career.
Unfortunately, I don't believe that fixing the problem in the database by
understanding and resolving it in a logical manner is the way things will be
in the near future. You know, RDBMSs aren't sexy ... developers developers
developers , web developers web developers web developers :) the newest API
and/or language are much more important nowadays for the newcomers. I cannot
believe it, it's out of my mind, but that's the reality. Noone wants to
utilize the database at its fullness anymore, everything needs to be done at
the gazilion of app servers by using java/.net or the trendy 3rd/4th
generation language of the day.

My only hope is that the data is the thing that is not going away, and this
forces us to think in a logical way about it(eventually devs/web devs will
start doing the data maniplation and etc the right way, I guess you're pretty
familiar with the approach in mind).

Thanks

~~~
nostrademons
I remember c. 2002-2005, it was RDBMSes that were the overused, overhyped
solution. Basically every webapp had to be backed by a database. Everybody was
moving to 3-tier architectures (even for things as simple as a TODO list),
frameworks like Struts and Rails and Symphony and Django all sprung up to
manage the complexity of talking to the DB, you'd see all these tips on blogs
(themselves among the first database-backed webapps, even though they worked
perfectly well as flat files) about how to normalize your schema and shard
your data for scale and pick ID spaces and make your joins perform well.

Meanwhile, a small minority of people were like "Guys? GUYS! You can just use
flat files and in-memory data structures for that. Why bother with a database
when 3 hashtables and a list will do the same thing several orders of
magnitude faster." They had outsized accomplishments relative to their
shrillness, though: Yahoo, Google, PlentyOfFish, Mailinator, ViaWeb, Hacker
News were all built primarily with in-memory data structures and flat files.

These things run in cycles. The meta-lesson is to let your problem dictate
your technology choice instead of having your technology choice dictate your
problem, and build for the problem you have _now_ rather than worry about the
problems that other people have. For many apps, a hashtable is still the right
solution to get a v0.1 off the ground, and it will scale to hundreds of
thousands of users.

~~~
polsoul
Fully agree with you - there are problems and problems, and 'use the right
tool for the right job' should always be the way to go.

hashtalbe or flat files,huh, hard to believe this is going to be useful for a
non-trivial app as 'data storage engine'(so to say). You will have to
implement everything yourself: \- concurrency - several users updating one and
the same entry \- data consistency - to return the data as of the start of the
'select' for the user,what about others who updated the entry but still didn't
commit. \- security - fine grain access to specific data entries - what can be
extracted by whom, what about auditing(who did what at what time and etc.) \-
backup/recovery and etc.

More code most of the time means more bugs. Not everyone is Google , Amazon or
Facebook to have the resources to create good custom solutions and support,
improve them.

For most cases, a simple rdbms database will provide you with enough
'standard' functionality to not reinvent the wheel(think
security,concurrency,consistency and etc.) = to code these on your own from
scratch.

But in the end, it's again the nature of the problem(whether at all you'll
need security, concurrency or consistency and etc.). It's nice to know what
the database as a tool gives you , and I don't think this is the case nowadays
- too many java/.net, and many other developers , have no idea what they can
get out of a database. And I really hate when one spent several weeks to write
something that could be done in several hours by using a specific database
feature.

~~~
nostrademons
Most of the time, you can ignore all those concerns when you're starting out.
Just run a single-threaded webserver and make one HTTP request = one
transaction. You can keep your registered users in memory and store pointers
to them in a list as your ACL, and log audit trails as a linked list. Snapshot
data to a file periodically for crash resistance, and load it back in at
system start.

Sure, you won't be able to run on more than one core, and this won't work if
your data exceeds RAM. And I wouldn't do this for anything mission-critical;
it's better for the types of free or cheap services where users will tolerate
occasional downtime or data loss. But if you never hit the disk and never need
to make external connections to databases or the network, you can easily do
upwards of 10K req/sec on a single core these days, even in a scripting
language like Python. Assume that you've got 10,000 simultaneous actives (this
is more than eg. r/thebutton and most websites, and usually translates to
around ~1M registered users), this is one req/user/sec, which is pretty
generous engagement.

Hacker News is built with an architecture like this. There've been a couple
massive fails related to it (like when PG caused a crash-loop by live editing
the site's code and corrupting the saved data in the process), but by and
large it seems to work pretty well.

~~~
ntoshev
Yup, I'm actively using this architecture, in Python, with gevent (so that
calls to external services don't stall your web server). I spawn long running
computations in separate processes. I've described it here:

[http://www.underengineering.com/2014/05/22/DIY-
NoSql/](http://www.underengineering.com/2014/05/22/DIY-NoSql/)

It really achieves 10k simple requests/transactions per second on a single
core (or a $5/month VPS). The software support for it might have been better
though. I should really release some code that helps with things like
executing a function in another process, or verifying that some code executes
atomically.

------
foobarge
Contains the famous ``Too big for Excel is not "Big Data".''

------
gesman
...but resumes looks better after that :)

~~~
nimrody
So do job / company descriptions. A company has to be doing Machine Learning,
Big Data, Cloud computing these days to be "buzzword compliant" :)

~~~
gesman
Good to know as i'm about to be interviewed by one like that in SF :)

------
fs111
Pick a technology that can grow with your dataset. Take a look at Cascading or
if you like Scala, scalding. They work on your laptop and on a 2000 node
cluster the same way.

~~~
kod
I honestly don't know why anyone would use Cascading or Scalding when Spark
exists.

~~~
evancasey
Huge spark fan here. Love the execution model, API, supporting libs etc.

Unfortunately, Spark doesn't scale well on large datasets (10TB+). Sure, it's
possible (and has been done), but right now there are too many rough edges to
make it a better choice than Scalding/Cascading for data processing at scale.
Most of this boils down to fine tuning certain Spark parameters, which is a
pain when you're dealing with long-running, resource intensive workflows.

~~~
kod
You have any references to support that 10tb claim?

------
baldfat
Look at R and realize your data can fit in RAM. Almost every data set examples
for clustered could really fit inside of a computers memory.

~~~
bagels
What servers support, say, 64tb of ram, let alone the petabytes that a much
larger company may have to deal with?

------
jowiar
Using "big data" technology is more than just an factor of input size. If your
operations making the data bigger, you might well need a bigger tool. You can
start with a data set on the middle-to-high-end of memory-sized, but if you
have an operation in there which combinatorially explodes the input data,
Spark is going to start looking pretty good.

------
michaelochurch
I could be wrong here, but my sense of Hadoop is that it's going to prove to
be a dead end in the long run. Granted, evolutionary dead ends in technology
can be very lucrative (look at Cobol, or Java). They aren't exciting, but
there's money in them. That said, the aesthetic compromises made to appeal to
the Java community, scaled out over years or decades, tend to make things
bloated and hard to use. I don't see people being excited to do Hadoop, or
wanting to learn it on their own time. It's much more interesting to learn the
fundamentals of distributed systems than to learn a specific set of Java
classes.

Personally, I'm finding myself increasingly burned out on "frameworks" that
appeal to out-of-touch corporate decision-makers (who are attracted to the
idea of a packaged "product" that solves everything) but tend to be
overfeatured while under-delivering (especially if time to learn how to use
the framework properly is included) when it counts. I'm much more interested
in the modularity that you see in, say, the Haskell community where
"framework" means something that would be a microframework anywhere else.

------
kf5jak
I upvoted this purely because of the title!

