
Billions wasted on Hadoop startups, the same will eventually be true of Docker - quicksilver03
http://www.smashcompany.com/technology/billions-were-wasted-on-hadoop-startups-and-the-same-will-eventually-be-true-of-docker
======
hardwaresofton
I am very much a fan of hot-takes, but this one is trash --

> The money was wasted on hype. The same will eventually be said of Docker.
> I’ve yet to hear a single benefit attributed to Docker that isn’t also true
> of other VMs, but standard VMs allow the use of standard operating systems
> that solved all the hard problems decades ago, whereas Docker is struggling
> to solve those problems today.

Linux containerization (using the word "docker" for everything isn't right
either) is an isolation + sandboxing mechanism, NOT a virtual machine. Even if
you talk about things like LXC (orchestrated by LXD), that's basically just
the addition of the user namespacing feature. A docker container is _not_ a
VM, it is a regular process, isolated with the use of cgroups and namespaces,
possibly protected (like any other process) with selinux/apparmor/etc.

Containerization is almost objectively a better way of running applications --
there's only one question, do you want your process to be isolated, or not.
All the other stuff (using Dockerfiles, pulling images, the ease of running
languages that require their own interpreters since you package the
filesystem) is on top of this basic value propostion.

An easy way to tell that someone doesn't know what they're talking about when
speaking about containerization is if they call it a VM (and don't
qualify/note that they're being fast and loose with terminology).

All this said -- I do think Docker will die, and it _should_ die because
Docker is no longer the only game in town for reasonably managing (see:
podman,crictl) and running containers (see: containerd/cri-o, libcontainer
which turned into runc) .

[EDIT] - I want to point out that I do _not_ mean the Docker the company or
Docker the project will "die" \-- they have done amazing things for the
community and development as a whole that will literally go down in history as
a paradigm shift. What I should have written was that "docker <x>" where x is
"image", "container", "registry", etc should be replaced by "container <x>".

~~~
cmsj
I'm not going to support the general thesis of this article, but I want to
address something you said.

You're right that containers are not VMs, but that's only really relevant as
pedantry of technical details.

I think that what the author was trying to say (without really understanding
it) was a comparison of containers to VMs as units of software deployment.

I don't think anyone is credibly using containers as a security measure on
Linux, because if they think they are, they are in for several large
surprises.

Rather, we're seeing the unbundling of software - it used to be that you
deployed software to a physical machine with a full OS, then you could deploy
it to a virtual machine with a full OS, then you could deploy the process, its
dependencies and a minimal OS into a container.

I agree that Docker doesn't have a huge and profitable future ahead of it,
because it's providing commodity infrastructure. Rather I think it's
interesting to think about what the next level of software deployment
decomposition will be, and I'd wager that it's FaaS (ie serverless).

~~~
vorpalhex
> You're right that containers are not VMs, but that's only really relevant as
> pedantry of technical details.

That isn't pedantry, it is an extremely critical point and the one most people
miss when figuring out Docker. Both from a security context (docker doesn't
provide vm level promises about isolation) and from a resource management side
(docker is really close to no overhead).

It is not uncommon to deploy containers on VMs in the real world...

~~~
hardwaresofton
This is precisely why I said the fact that someone would gloss over this is a
red flag. The point is _super critical_.

VMs are literally so hard to do correctly and in a performant fashion that
parts of CPU instruction sets[0], and kernel subsystems (KVM[1]) were created
to make them easier to run. Containers, in contrast are literally a few flags
and a bunch of in-kernel antics.

A few people, notably Liz Rice and Jessie Frazelle have given talks on how to
make containers from scratch that are _very_ illuminating for those that are
interested:

[https://www.youtube.com/watch?v=HPuvDm8IC-4](https://www.youtube.com/watch?v=HPuvDm8IC-4)

[https://www.youtube.com/watch?v=cYsVvV1aVss](https://www.youtube.com/watch?v=cYsVvV1aVss)

[0]:
[https://en.wikipedia.org/wiki/X86_virtualization](https://en.wikipedia.org/wiki/X86_virtualization)

[1]: [https://en.wikipedia.org/wiki/Kernel-
based_Virtual_Machine](https://en.wikipedia.org/wiki/Kernel-
based_Virtual_Machine)

~~~
panpanna
Containers are "easy" because they are backed by tons of kennel code (cgroups,
namespaces and basically a small part of many other subsystems). You can
actually create containers from the shell!

VMs are "hard" because you start with nothing save some very low-level help
from the hardware.

~~~
hardwaresofton
Also, I want to be clear that I'm using "simple" and "easy" in the rick hickey
sense of the words, as in "simple" has more to do with what a thing is made
of, and "easy" has more to do with ease of use, taking available
tooling/familiarity and context into mind.

What I should have made clearer was that I think containers are both easier
_and_ simpler than VMs. There are both less moving parts (I haven't looked,
but I assume less code), _and_ containers are easier to get started with than
VMs (set some flags on some syscalls versus make sure you buy the right CPU).

~~~
panpanna
Fair enough, but don't forget there are some extremely simple hypervisor
implementations out there too.

------
imtringued
I don't understand how these are comparable. Hadoop solved a hard problem that
nobody had. Docker solves a simple problem that everyone has. It would make
sense if you're talking about kubernetes and using it to build hundreds of
microservices because it's currently in fashion. Whether you're using Docker,
Packer, Ansible or whatever doesn't matter. They are all a solution to the
same problem and saying one is better basically boils down to saying which
brand of hammer is better.

~~~
pytester
>Docker solves a simple problem that everyone has.

Docker provides an (IMHO pretty buggy) isolation layer that lies between
"keeping things that need to be kept separate in separate folders" and
"keeping things that need to be kept separate in separate virtual machines".

I actually don't have the need for the level of isolation below VM and above
folder very often. IMHO this level only really makes sense when containing and
deploying somewhat badly written applications that have weirdly specific, non-
standard system level dependencies (e.g. oracle) that you don't want polluting
other applications' dependencies.

I've compiled and installed postgres in separate folders lots of times (super
easy) and I've lost count of the number of times people have said "why don't
you just dockerize that?" as if that was simpler and/or necessary in some way.
That's the effect of "docker hype" talking.

~~~
otabdeveloper1
90% of the time Docker is used to solve the problem of "how do I upload this
bucket of Python crud to a production server?" (Replace 'Python' with any
other language to taste.)

A slightly smarter .tar.gz would have solved the problem just as well.

~~~
eeZah7Ux
> A slightly smarter .tar.gz would have solved the problem just as well.

It's called "OS package" ;) and can provide more strict sandboxing using a
systemd unit file: unit files provide seccomp, cgroups and more.

~~~
bdavis__
docker solves 2 problems. first is you have no control over your devs and
allow them to install any software from anywhere. and second is you want to
sell cpu time from the cloud in an efficient way (for the seller).

~~~
dijksterhuis
Disagree with both statements.

1) is not a containerisation problem. It’s a team problem. I can jam in a load
of npm and pip installs in to a shell install script. Maybe even delete /usr/
for the hell of it. Because the script isn’t isolated from the OS I can cause
more damage.

This problem is actually solved by doing code reviews properly and team
discussions.

2) errr no. Containers != infrastructure. If you want to deploy on bare metal,
you can.

------
chime
> The same will eventually be said of Docker. I’ve yet to hear a single
> benefit attributed to Docker that isn’t also true of other VMs,

I bought a Raspberry Pi and using a few commands, installed pre-configured
Docker ARM images for 8-9 different media applications that would've taken me
days to setup and manage individually. I didn't have to worry about
dependencies or compilations. It just worked.

~~~
beagle3
Properly packaged Debian/raspbian apps are still an “apt install” away. Your
use case, which is common, tells me that packaging / distributing may need
some love, not that there’s a fundamental difference.

And the convenience does not come free - a random docker is almost as bad as a
random executable.

~~~
vetinari
Packaged and configured are two different things. The fun begins, when you
need several applications, which each have different ideas about configuring
services, that would be otherwise shared.

Then, as a matter of convenience, you get packages, that either bundle
everything preconfigured to their liking (i.e. gitlab omnibus) or packages,
that configure shared services how they need it, and running them shared is
not supported (i.e. freeipa, which will configure httpd as it needs and forget
having anything else served on the same machine).

Docker provides a way to isolate these, so you can still use the same
resources you have to run applications, that would not cooperate with each
other on a single machine, without having to run separate OS instances in
separate VMs.

~~~
beagle3
I mostly Ubuntu these days, have been for 15 years, and I cannot recall a time
installing a package like Freeipa disabled a website installed by other
packages. Obviously YMMv and I might have just been lucky.

~~~
vetinari
With freeipa I meant server, not client. Installing is not enough, configuring
it into usable state is needed too (i.e. running freeipa-server-install or
freeipa-replica-install).

It is possible to run other services of such configured httpd, but you need to
be careful. When freeipa wanted mod_nss, you had to use mod_nss, not mod_ssl
(though they switched it); when freeipa wants to use gssproxy, you are going
to use gssproxy too. These changes can happen during upgrades, and it is up to
you to fix everything after such change.

The project doesn't recommend to run anything else with the same server; you
are free to try though.

The point was, that with docker or another container system, any such problems
are irrelevant, and it allows you to have separate service instances without
having to run separate VMs.

------
threeseed
Cloudera earnt $145m last quarter and grew by 37% over the previous quarter.
Other Hadoop startups like Databricks are doing well and Docker has gone from
a 2 digit revenue to a 3 digit revenue company from 2017-2018. How has
billions been wasted when we have successful companies doing well against the
toughest competitors ever i.e. Google, Microsoft and Amazon.

~~~
geodel
Well that include revenues from Hortonworks also. With revenue rise losses are
also rising to $85.5M. I feel in couple of year they will fold up or brought
by some big cloud vendor.

------
sasavilic
You can't really compare VM with Docker. Managing containers with Docker +
Kubernetes is far easier then managing VMs. Docker might be replaced with
something else in future (i.e. rkt), but basic concept IMHO is here to stay.

~~~
aduitsis
What about setup and maintenance of the infrastructure that hosts either VMs
or containers?

Given a mediocre number of physical machines (say, 40), which is easier,
install, setup and maintain a VMWare cluster or a Kubernetes cluster?

If someone has any insight, preferably backed by actual experience, it'll be
most appreciated.

~~~
kklimonda
Assuming unlimited budget for hardware and licenses, then building a team that
can deploy and manage VMWare on 40 nodes will be much easier - however raw VM
is not really comparable to what k8s is giving you, so you'll be also solving
reproducible deployments, load balancing traffic to your cluster etc. Still,
with enough money those can be solved by purchasing more hardware&software,
and you'll have easier time finding people who can maintain that over on-
premise kubernetes.

~~~
aduitsis
Many thanks!

------
bryanrasmussen
Isn't this the way it normally works though, a bunch of investments don't work
out - those are wasted - those that do work out, the people who did the
investing get more money back.

Or to put it another way: There must have been some few Hadoop investments
that worked out, the same will eventually be true of Docker.

~~~
gdulli
I think what they've noticed is a similar arc where both products got
initially sold as being universally required and applicable and generational
which justified enough investment to spawn a whole industry.

And instead of fulfilling such dramatic hype they're both just good tools that
are far from universally needed, not objectively superior to all other
options, and there's nothing special about them that will keep them from
getting supplanted by newer tools, which is the norm for even the industry
even if the tools are good.

------
metaphor
> _...but standard VMs allow the use of standard operating systems that solved
> all the hard problems decades ago, whereas Docker is struggling to solve
> those problems today._

What are these supposed "hard problems" the author speaks of?

~~~
saagarjha
Sandboxed, consistent environments to run code in?

~~~
7373737373
All this money should have been spent on developing new, improving existing or
switching to better (operating) systems which solve the resource and
communication security problems, instead of creating another inner-platform
effect.

I hope WebAssembly goes in this direction, instead of trying to adapt to
current programming language paradigms.

~~~
AmericanChopper
>All this money should have been spent on developing new, improving existing
or switching to better (operating) systems which solve the resource and
communication security problems

But this isn’t the problem Docker is trying to solve. It’s just a problem that
Docker needed to solve in order for their product to be useful, this is
completely transparent to Docker users. Docker abstracts away a whole bunch of
work you’d otherwise have to do to implement repeatable builds, it makes those
builds widely distributable, and (depending on how you choose to use
containers) can also simplify some capacity planning problems.

~~~
robfig
I thought Docker builds are not generally repeatable, since they often `apt-
get update && apt-get install`, which depends on the current state of external
package management?

They are definitely not reproducible in the sense of building bit-for-bit
identical containers, unless you use Bazel.

That being said, I've found Dockerfiles to be a much more reliable build
process than most others (recently struggled to get through LogDevice's cmake-
based build.. ugh).

~~~
AmericanChopper
You’re correct, but how reproducible you’re builds end up being depends on how
you use it, and how reproducible you need them to be depends on your use case.
Maybe a particular use case wouldn’t fit in very well, maybe it would be
better served by using something like packer, maybe your dependency management
requirements mean you should use something like artifactory. No technology is
going to be suitable for everybody’s needs, but Docker provides enough value
to enough people that it’s found a place in the market. If it Docker dies, I’d
imagine it would be because it was replaced by something better, not because
people suddenly realized that they weren’t getting any value out of it.

------
some_random
Seriously, who reads this kind of garbage? It's painfully clear to anyone with
any experience with Docker that the author hasn't even skimmed the wikipedia
page. Reminds me of the idiots blasting out blog posts about how BITCOIN IS
THE FUTURE one month, then BITCOIN IS A SCAM the next.

------
mananvaghasiya
> I’ve yet to hear a single benefit attributed to Docker that isn’t also true
> of other VMs

Oh boy

------
Jonnax
Who is this commenter?

Containerised applications are commonly used. At this point it's a proven
technology with clear use cases.

------
alfiedotwtf
The difference being that people actually use Docker

------
mrosett
The article just quotes the news about MapR and then asserts the same will
happen with Docker. That may be true, but there’s no evidence here.

------
skc
I suppose I haven't worked in environments scary enough that Docker was a
necessity.

As a result I find it difficult understand the hype.

------
kumarvvr
I have come across quite a few articles that mention that for 99% of 'big'
data problems, Hadoop and the like are an overkill. Simple tools, with a beefy
machine is just as sufficient for the task.

Is that the reality of today?

Personally I too feel that distributed computing is an overkill for most 'big'
data problems.

~~~
huffmsa
Yes. Very very very few people have data as big as "the internet" and the need
for speed (which was the reason Google developed a look of their distributed
tooling) that you get with these frameworks.

And it really only makes sense to have permanent infrastructure for
distributed computing if you're constantly using it. Like if you're constantly
rebuilding an index of the internet. Which most people aren't.

For occasional reindexing jobs, I've personally had success with Kubernetes.
Our customer facing services were deployed with it, so it was trivial to bump
up the number of nodes, and then schedule a bunch of workers containers, then
roll everything down when the job is complete. No need to learn the ins-and-
outs of Hadoop.

~~~
threeseed
It is very, very, very common for companies to have enough data to need a
distributed approach to ETL and especially common if they are doing any
machine learning which is most of them.

In telcos you have network telemetry data. In supermarkets and retail you have
purchase data and often credit card gateway data. In banking and finance you
obviously have transaction data.

And with Kubernetes you still need a compute framework. Like I don't know the
standard in the industry the Hadoop Spark framework.

~~~
huffmsa
A lot of companies think they need machine learning because the management
consultants tell them they do (and hey, we can set it up for you!). And those
that are "doing machine learning", often aren't doing it very well and have
trouble applying it to the business.

Note I'm not saying _all._ Just 99%, like the parents comment referred to.
That leftover 1% are the companies you can name off the top of your head.
ExxonMobil, Target, Chase, Visa, etc.

------
jgalt212
My beef with Hadoop, and other big data tools, is that for pretty much any
task other than outlier detection, sampling works as good, and is cheaper and
easy to manage and reason about.

Even Google, the king of big data, will sample your hits on Google Analytics
if your site gets too much traffic.

------
viraptor
> and number three – well it’s not even worth staying in business because
> there’s no money to be made,

I guess tell that to the 10s of companies doing PaaS. Or another 10s doing app
monitoring / logging. They're successful companies, far from just "breaking
even".

------
jressey
Billions are "wasted" on startups in every industry. That's the point. The
rich guys gamble and some of their gambles pay off. Most of them fail.

------
harimau777
What is a "Hadoop business"? Hadoop is a tool not an industry or product. Can
someone explain what the author might be trying to say?

~~~
dijksterhuis
Businesses that based their SaaS data analysis products on Hadoop.

Preconfigured clusters, integration with your existing AWS deployments. All
that sort of jazz.

------
bayareanative
Yup. Tech faabionabilism is akin to Big4/MBB FUD flavor of the quarter...
populism/marketing doesn't a necessity make. Docker, Kubernetes, gulp, Hadoop,
mosh, Nix, SmartOS, serverless, cloud, virtualization, [insert tech fashion
hype > utility here].

Speaking of Hadoop: my vehicle is parked outside one of the HQ's of another
top 10 Hadoop startup. It's one of the most expensive, nearly empty buildings
in the highest rent areas of the Valley. (Money flushing sound here.)

Fun fact: one of the enterprise Hadoop CTOs is a broney.

~~~
threeseed
I love how the "cloud" is a tech fad in your eyes as well as technologies like
Kubernetes which are core to Google and virtualisation which is core to every
VPS/cloud provider in existence.

Be a pretty awful world if everyone took your advice.

~~~
AmericanChopper
Isn’t it obvious? Everybody who doesn’t operate their own datacenters are
mindless technology hipsters.

Whenever a new big piece of tech comes out, most of the detractors seem to
either a) find a use case that isn’t fit for purpose or b) try to use it
without bothering to learn how, and the proceed to say ‘see, it’s not all it’s
cracked up to be’.

------
xenospn
So I've never used Docker, but it sounds an awful lot like FreeBSD Jails. How
is it unique?

~~~
kkapelon
Much more userfriendly. Works on all major platforms. Big public repository
(dockerhub). Extensive tooling for everything. Graphical and CLI tools for all
kinds of users.

------
villgax
What is even a Docker startup? Docker the tool is a relief in every way & so
is Kubernetes.

------
dio123
How about “service mesh”?

