Hacker News new | past | comments | ask | show | jobs | submit login
Billions wasted on Hadoop startups, the same will eventually be true of Docker (smashcompany.com)
227 points by quicksilver03 27 days ago | hide | past | web | favorite | 209 comments



I am very much a fan of hot-takes, but this one is trash --

> The money was wasted on hype. The same will eventually be said of Docker. I’ve yet to hear a single benefit attributed to Docker that isn’t also true of other VMs, but standard VMs allow the use of standard operating systems that solved all the hard problems decades ago, whereas Docker is struggling to solve those problems today.

Linux containerization (using the word "docker" for everything isn't right either) is an isolation + sandboxing mechanism, NOT a virtual machine. Even if you talk about things like LXC (orchestrated by LXD), that's basically just the addition of the user namespacing feature. A docker container is not a VM, it is a regular process, isolated with the use of cgroups and namespaces, possibly protected (like any other process) with selinux/apparmor/etc.

Containerization is almost objectively a better way of running applications -- there's only one question, do you want your process to be isolated, or not. All the other stuff (using Dockerfiles, pulling images, the ease of running languages that require their own interpreters since you package the filesystem) is on top of this basic value propostion.

An easy way to tell that someone doesn't know what they're talking about when speaking about containerization is if they call it a VM (and don't qualify/note that they're being fast and loose with terminology).

All this said -- I do think Docker will die, and it should die because Docker is no longer the only game in town for reasonably managing (see: podman,crictl) and running containers (see: containerd/cri-o, libcontainer which turned into runc) .

[EDIT] - I want to point out that I do not mean the Docker the company or Docker the project will "die" -- they have done amazing things for the community and development as a whole that will literally go down in history as a paradigm shift. What I should have written was that "docker <x>" where x is "image", "container", "registry", etc should be replaced by "container <x>".


I'm not going to support the general thesis of this article, but I want to address something you said.

You're right that containers are not VMs, but that's only really relevant as pedantry of technical details.

I think that what the author was trying to say (without really understanding it) was a comparison of containers to VMs as units of software deployment.

I don't think anyone is credibly using containers as a security measure on Linux, because if they think they are, they are in for several large surprises.

Rather, we're seeing the unbundling of software - it used to be that you deployed software to a physical machine with a full OS, then you could deploy it to a virtual machine with a full OS, then you could deploy the process, its dependencies and a minimal OS into a container.

I agree that Docker doesn't have a huge and profitable future ahead of it, because it's providing commodity infrastructure. Rather I think it's interesting to think about what the next level of software deployment decomposition will be, and I'd wager that it's FaaS (ie serverless).


> You're right that containers are not VMs, but that's only really relevant as pedantry of technical details.

That isn't pedantry, it is an extremely critical point and the one most people miss when figuring out Docker. Both from a security context (docker doesn't provide vm level promises about isolation) and from a resource management side (docker is really close to no overhead).

It is not uncommon to deploy containers on VMs in the real world...


This is precisely why I said the fact that someone would gloss over this is a red flag. The point is super critical.

VMs are literally so hard to do correctly and in a performant fashion that parts of CPU instruction sets[0], and kernel subsystems (KVM[1]) were created to make them easier to run. Containers, in contrast are literally a few flags and a bunch of in-kernel antics.

A few people, notably Liz Rice and Jessie Frazelle have given talks on how to make containers from scratch that are very illuminating for those that are interested:

https://www.youtube.com/watch?v=HPuvDm8IC-4

https://www.youtube.com/watch?v=cYsVvV1aVss

[0]: https://en.wikipedia.org/wiki/X86_virtualization

[1]: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine


Containers are "easy" because they are backed by tons of kennel code (cgroups, namespaces and basically a small part of many other subsystems). You can actually create containers from the shell!

VMs are "hard" because you start with nothing save some very low-level help from the hardware.


Also, I want to be clear that I'm using "simple" and "easy" in the rick hickey sense of the words, as in "simple" has more to do with what a thing is made of, and "easy" has more to do with ease of use, taking available tooling/familiarity and context into mind.

What I should have made clearer was that I think containers are both easier and simpler than VMs. There are both less moving parts (I haven't looked, but I assume less code), and containers are easier to get started with than VMs (set some flags on some syscalls versus make sure you buy the right CPU).


Fair enough, but don't forget there are some extremely simple hypervisor implementations out there too.


> I think that what the author was trying to say (without really understanding it) was a comparison of containers to VMs as units of software deployment.

I agree, I think the next evolution of software deployment is definitely heading to sub-program sizes. I do want to point out that we've seen this before, it was called CGI. It's not exactly the same, and things will be better this time (more isolation, better tooling), but if we're doing functions-as-a-service now, we were doing scripts-as-a-service much earlier.

I think the great unbundling of the future is definitely coming and in fact it's already here -- it's just unevenly distributed.

> I agree that Docker doesn't have a huge and profitable future ahead of it, because it's providing commodity infrastructure. Rather I think it's interesting to think about what the next level of software deployment decomposition will be, and I'd wager that it's FaaS (ie serverless).

This was not what I meant to get across -- Docker may have a huge and profitable future ahead of it, but is only tangentially related to the near assured continuance of containerization. Docker as a company and project's goals are different, they do more than simply offer a way to run containers, and they have for a very long time -- my point was that the literal use of the word "docker" should die, because we should be just referring to containerization in the general sense (no matter which lib you're using). It's like the "tissue" vs "kleenex" debate, in a way.


That's a fair point, small-d-docker should not be a thing.


It’s very easy to underestimate how helpful it can be when you first start working with it. It’s a black box that uses root to do everything and is a pain to debug. Because of this it becomes easy to hand wave it away.

Once you have a properly setup project going and your entire build process is mostly repeatable the benefits start becoming more obvious. Yes, you can do all the same things to a certain extent in a vm but it’s really hard to keep that streamlined and up to date. Having a script that sets up your stack in a vm on both windows and Mac then runs on Linux is a pretty big maintenance nightmare. A dockerfile works with a few commands and can be added to your repo.

It’s not without trade offs but I think if they can solve the issue of debugging in a better way then we’ll really see things solidify on this concept.


> It’s a black box that uses root to do everything and is a pain to debug.

This is less true these days (and on macs that had to use the docker machine "hack" it was barely ever true, per say) -- rootless containers are on the way thanks to user namespaces. For example LXC can run fully rootless containers that act more like VMs themselves (as in they will have systemd as pid 1 inside) -- kernel support, user namespaces, and File systems in Userspace (FUSE) make this possible.

IMO One of the biggest benefits to running containers is actually E2E tests -- I don't see it done as often as I should, but it has become drastically easier to run an entire postgres instance for a single local E2E test runs -- I do this on almost every project I start now, I set up E2E tests that spin up the actual world I expect in production (so all the backing services, at the version they will run at), and interact with my application -- this is a huge step forward compared to a huge wiki that has "how to set up the local test VM" -- you can spin up and shutdown these services so fast that you can use them ephemerally (I do) without much worry.


Lxc/lxd is a fantastic piece of software that I wish more people would try. Especially if people already on Ubuntu, Debian and Arch.

You absolutely don't need to be a cloud person to use containers. For example I use lxc on my laptop to keep different projects separate and test new software.


I agree -- it also offered more isolation (user namespaces) from very early on and is way more featureful (live migration has been a thing in LXC for a while IIRC).


Have any links to share that you really liked as examples? I'm using docker compose to do what I think you are saying you do with just pure lxc. I would love to know more about how you do it.


Not really. I personally use each container like a full VM with it's own software. I don't try to combine a lot of small containers into a system the way you would do with docker.

Here is a general tutorial in anyone wants to test lxc:

https://linuxcontainers.org/lxd/getting-started-cli/


> One of the biggest benefits to running containers is actually E2E tests...

100% agree. Most developers don't even bother testing code that requires DBs and other external services because it's a pain to keep test environments up and running so a lot of the most important and buggy code slips through the cracks.

Then you have the developers who do test but just make heavy use of mocking libs and other tricks which aren't helpful in the long run since the mocks don't actually do anything.

But setting up a docker compose with the whole stack - completely reset each time the tests are run - makes integration tests so much more valuable. And usually it's not a huge deal to set up the plumbing since the docker files are often already created for production.


Docker containers must run on a vm on Windows and Mac.

So it's an innovation in packaging.


"All this said -- I do think Docker will die"

Harsh. Docker is a nice way to specify how you want your container to work. Sure release Bocker (A better Docker). But Bocker will just basically be Docker: a simple set of instructions to get a machine running.

That is all we want. A few instructions to get a system up. Devs are sick of setting up machines. Demand is there.


I agree the way I said it was harsh -- I don't mean Docker the company should die or Docker the project should die or anything of that nature, I mean literally people referring to everything in containerization as "docker" + <whatever the thing is>.

Also, Docker itself is very much trying to be more than what it is right now, Docker swarm, compose, and basically everything Docker as a company does to improve their product offering moves them farther away from the way people are using "docker" in conversation right now.

Basically, when people say "docker image" they should be saying "container image" (for lack of a better term, at least).

> That is all we want. A few instructions to get a system up. Devs are sick of setting up machines. Demand is there.

This is pretty vague -- "setting up machines" can mean a lot of things. I want to make it clear that it's not the container runtime's job to set up a single machine, it's to enable a process to pretend that the machine it has access to is it's own. It's a subtle difference but it's worth noting IMO.

If you want to fix setting up your machines, you should be looking at tools like Packer, Container Linux distributions (formerly CoreOS, now Atomic I believe), Nix, Guix, and Linuxkit. Docker does not set up machines, it sets up processes, but happens to bring along a filesystem (amongst other things).

A linux container is not a machine (virtualized or otherwise), it is an isolated process, I agree that devs are sick of worrying about machine configuration when setting up the required dependencies to run their processes, though, containerization is here to stay -- it was around before docker and will be here after.


I agree that the industry will waste billions on startups. There's not really a market there for 'Docker' containers, and there never was.

docker-containers and it's related counterparts are abstractions. Useful abstractions don't necessarily equate to a new line of business.

The best attributes of containers (IMO) are packaging and distribution. What business and operators need is a repeatable, easy way to deploy applications across there infrastructure. Containers are one piece of that story.

The bigger piece, and IMO, where the business viability is, is the orchestration layer. Containers aren't very useful by themselves, you need a way to get your application online. That's where Kubernetes comes in.

You need to understand large organizations and their challenges to see what layer of the containerization stack holds the most value.

Long-term, I see 'linux containers' as we know them going away. The industry is going to move to something like [1]: lightweight, hardware-assisted VM/container hybrids. But, no matter what happens at the containerization layer, the orchestration layer is the piece that adds business value to end-users (eg, not AWS or other hosting providers).

1: https://katacontainers.io/


So, last time I had a web dev job was back in 2016, so the whole container thing kind of passed me by. So I don't know the details- but, on the other hand, I also don't have any baggage about it being "a VM".

But, starting from that kind of "clean slate" state I have to say that if it takes exasperated internet posts, like your helpful comment, to explain why containers are not like VMs and how they are not like VMs... well then maybe they are not that much not like VMs to make them such a big new thing.

That goes for many things. Like, I don't get the difference between Volley and Beach Volley. One is played on the beach. So it's volley? Played on the beach?


Beach volleyball is played typically with 2's or 4's. The rules are slightly different, court size is smaller. There are no positions in 2's, there are no position faults.

Feel free to make analogies between two other similar sports, such as Competition Karate and Taekwon-Do.


I'm sorry but this still sounds like the same game to me. Why is it not called Beach 2-by-2-ball, or somethign similar, if the number of players makes it so very different? If I get a couple of my mates together and we bounce a ball over a net counting points when the ball hits the ground- how is what we're playing not (an informal game of) volleybal?


How is basketball different from indoor volleyball? They're both played indoors on a wooden floor court, both have two teams in opposition, playing simultaneously. Both have in bounds and out of bounds. Both have a ball. Both involve passing a ball. Both have scoring dictated by placement of the ball.

I mean, we can talk about things that make things similar all day, it doesn't make them the same thing. Is a hamburger a sandwich? Sometimes, the whole is greater than the parts, not everything is perfectly decomposable in life.


>> How is basketball different from indoor volleyball?

Because one is about putting the ball through a basket and the other is about bouncing it over a net?

>> Is a hamburger a sandwich?

Yes, actually.

But, a very special kind of sandwich.


Well the thing is, they aren't a big new thing! But the hype train needed it to be (and to be fair, the ecosystem has benefited greatly from the attention).

You rarely see hypetrains for old, established things :)


> they call it a VM (and don't qualify/note that they're being fast and loose with terminology).

It's only confusing to people who are familiar just with the popularized forms of computer science terms.

Isolation and sandboxing is virtualization. In a container, the applications seem to have an operating system and machine to themselves.

A single Unix process and its address space is also a kind of virtual machine, creating the illusion that the process has a machine all to itself. Thanks to virtual memory, other processes are not even visible; they are in a different name space of pointers. That concept breaks for multi-process applications: processes are aware of each other through manipulations of shared resources like files. Or effects like not being able to bind a networking port because some other process is tying it up. The next level of virtualization is to have namespaces for resource-related namespaces in the system beyond the address space. As far as just the filesystem space goes, we can virtualize with tools like chroot. A group of applications can have their own global /etc configuration, their own version of the C library in /lib and so on. That's the beginning of "containerization".


> It's only confusing to people who are familiar just with the popularized forms of computer science terms.

Yeah given how often people mistakenly call containers VMs or assume they have VM-like properties (like the ability to run a different kernel), this is tripping a lot of developers up, not just lay people.

> Isolation and sandboxing is virtualization. In a container, the applications seem to have an operating system and machine to themselves.

I don't think this is quite true, generally you can isolate and sandbox something without virtualizing anything. In this case to isolate and sandbox without a loss of functionality, we are virtualizing and swapping swap implementations out from underneath an application/process. That is not always the case though -- if you unplug a machine's ethernet cable it is isolated, but you did not virtualize anything, if you deny access to a folder on disk for a process you are isolating without virtualizing, unless you mean that any kernel interaction (like blocking an `fopen()`) is "virtualization".

> A single Unix process and its address space is also a kind of virtual machine, creating the illusion that the process has a machine all to itself. Thanks to virtual memory, other processes are not even visible; they are in a different name space of pointers. That concept breaks for multi-process applications: processes are aware of each other through manipulations of shared resources like files. Or effects like not being able to bind a networking port because some other process is tying it up. The next level of virtualization is to have namespaces for resource-related namespaces in the system beyond the address space. As far as just the filesystem space goes, we can virtualize with tools like chroot. A group of applications can have their own global /etc configuration, their own version of the C library in /lib and so on. That's the beginning of "containerization".

This is a good summary of how containerization works, and the layers of isolation provided by the kernel -- the approach here is isolation by providing a fake, but that does not mean that isolation + sandboxing = virtualization, that just means that these forms of virtualization can be used to provide isolation & sandboxing.


Errr... how do these people think macOS and Windows can run Linux containers? It is a VM on those platforms.


> do you want your process to be isolated, or not.

No, not always. Why?.

At work I have a few coworkers pushing hard to dockerize (isolate?) everything.

This makes debugging when things go wrong a lot harder.

I see isolation as one of several qualities a process could have, that sometimes is valuable enough to be worth the sacrifice.

Isolation is not some absolute quality that is without significant tradeoffs.


> This makes debugging when things go wrong a lot harder.

I've found the opposite to be true. If you have proper observability into your isolated services you can more easily pinpoint the origin of a bug.


I avoided saying that processes should always be isolated because there are sometimes very good reasons to not isolate a processes with the containerization approach we're talking about, performance being one that came to mind quickly.

Containerization of processes definitely increases complexity but if you can take the time to understand VMs then you can (and should, IMO) take the time to understand how containers work as well, they are lighter and simpler (for example, you don't need to build a kernel or make an initrd). I would argue that people who think VMs are simpler are actually being fooled by huge advancements in tooling over the years and the fact that it's become "easy", not that it was ever simple.

I also want to point out that containers should actually make tracking down some bugs easier, but it does so in a counter-intuitive way -- it removes whole classes of bugs from ever occurring. You'll never have two programs clobber some shared folder or resource, you'll never have programs fight over dependencies, or struggle for locally-bound ports if you're running them in containers.

Containerization definitely represents an increase in complexity, but it is well worth the effort, most of the time, granted you understand the tooling.


> if you can take the time to understand VMs then you can (and should, IMO) take the time to understand how containers work as well

I don't see it as VMs vs containers.

We have a good devops process to deploy onto our instances, so we rarely have resource clashes you mention (ports/directories) because none of that is ever configured manually. All our infrastructure is derived from 'scripts', so it hasn't been a problem at all.

Aside from python, I see no advantage in containerizing any of our processes at all.

As for debugging, I always forget how infuriating it is, till in the heat of the moment I have to open up a shell into someone's badly made docker image and try to use common tools to help diagnose a problem (ps, nslookup, dig, all) all missing from the wonderful little container.

It's like being on a big navy ship, stranded in the ocean because the engines broke down, but everyone left all the tools back at the base. Yay!


> I don't see it as VMs vs containers.

It's not? I didn't mean to pit them against each other in competition, I'm saying that if VMs are worth learning about and taking the time time to understand, so are containers. It doesn't have to be zero sum.

> We have a good devops process to deploy onto our instances, so we rarely have resource clashes you mention (ports/directories) because none of that is ever configured manually. All our infrastructure is derived from 'scripts', so it hasn't been a problem at all.

It seems like it was a class of problems that you have fixed with "good devops process". I'd argue that it probably was a problem at once point, and you improved your devops process to make sure it wasn't.

> Aside from python, I see no advantage in containerizing any of our processes at all.

Well I don't know your infrastructure so I'm can't comment on that. I doubt that python is the only thing you run that could benefit from containerization (which again, means limiting access to system resources through namespaces and cgroups), but if you say so then I have no choice but to believe that it's the case.

> As for debugging, I always forget how infuriating it is, till in the heat of the moment I have to open up a shell into someone's badly made docker image and try to use common tools to help diagnose a problem (ps, nslookup, dig, all) all missing from the wonderful little container.

Sounds like you could use some more of that "good devops process" you had when you set up the deploy machinery.

Also, the fact that all of that stuff is missing from the container is actually beneficial from a security point of view -- the same inconvenience you're experiencing is the same inconvenience an intruder would experience first before breaking out of the container (assuming they had the skill to do that). This means that you have another chance to catch them downloading and/or running `ps`/`nslookup`/`bash` or whatever tooling and flag the suspicious behavior. Whether you're in a VM or not, containers are another line of defense, and that's almost certainly a good thing.


> It seems like it was a class of problems that you have fixed with "good devops process". I'd argue that it probably was a problem at once point, and you improved your devops process to make sure it wasn't.

It certainly was but we fixed it and it's not a problem anymore.

> Sounds like you could use some more of that "good devops process" you had when you set up the deploy machinery.

Yeah, there are people within my group that want to 'modernize' things and put them into containers willy nilly for no real reason.

We have already solved all the difficult problems that containers are supposed to 'save' us from. Many of the proposed containers would just be a single statically linked binary with a config file.

Why?

FYI, our stuff is hosted internally, so security considerations are not such a big deal.

To hear these container advocates, you'd think that till they came around no one ever managed to use linux.

I'm fully expecting linux userland tools to go away, to be replaced by custom 'distributions' with only a kernel and a docker API soon.


> I'm fully expecting linux userland tools to go away, to be replaced by custom 'distributions' with only a kernel and a docker API soon.

They're already here!

- CoreOS Container Linux (now owned by Redhat)[0]

- RancherOS[1]

- Kubic[2] (more focused on running Kubernetes, but same idea)

There are also tools like Linuxkit[3] which focus on helping you actually build images that run the containers you want and nothing else @ startup, which is pretty cool I think.

[0]: https://coreos.com/os/docs/latest/

[1]: https://rancher.com/rancher-os/

[2]: https://kubic.opensuse.org/

[3]: https://github.com/linuxkit/linuxkit


Docker famously doesn't isolate very well, as known in infosec circles for years now. If you're unaware, search 'containers don't contain'.

MicroVMs start faster and provide better isolation.

Meanwhile, none of this is relevant unless you're building your own cloud platform, which is a huge waste of time for most companies.

MicroVMs, containers, VMs, zones and bare metal are places to execute code. Serverless makes all those distinctions irrelevant.

Sorry if you spent 2015 getting really into Docker. You bet on the wrong horse. It's OK, this happens in tech.

Edit: if it's unclear, I don't mean Docker itself is the wrong horse, I mean containerisation tools per se are the wrong horse - and a bad place to invest your time unless you work for a cloud provider


And given recent revelations, VMs don't contain either!

Docker is far from betting on the wrong horse. I can build a docker container and deploy/orchestrate it however I wish - via a docker runtime, kubernetes (which can vary in its underlying implementation), microVMs, VMs, or bare metal.

But then the containers don't need to be docker either. Theres many options available there too.

Either way, there are many options available. And the horse(s) are still in the running.

As for "serverless" vs containers/other, if you can provision accordingly, containers/other can be a better option at scale. Horses for courses, I guess.


Sure, having a lot of VM knowledge is also irrelevant unless you're building your own cloud platform too. *

Again, most people don't need to build their own cloud platform.

Being able to reuse Docker files is nice, you've reinvented CFengine for the eighth time. That's great.

The point is Docker is yet another standard way to build boxes and contain them. Serverless / FaaS platforms obviate that need.

* obviously an isolated kernel is better than a shared one, but the point of this post (and this comment) is that Docker vs VM vs MicroVMs vs zones etc is irrelevant for companies who are not cloud providers


Your point is actually the point I'm trying to make -- you shouldn't be using docker to try and contain possibly malicious code, it's not for that.

It does isolate a process's view of the filesystem, it does isolate a process's view of the PID namespace, and that is valuable. Let's say it's a spectrum:

raw processes -------- processes with a certain user ---------- namespace+cgroup isolated processes ---------------- VMs

Docker is certainly an improvement for processes that you don't want running completely contained but want somewhat isolated. For isolation in every sense, you want a VM (a lightweight one if you can get it, i.e. some stripped down qemu).

> MicroVMs start faster and provide better isolation.

Agreed on isolation -- VMs are hands down better at isolating, but I'm a bit skeptical about starting faster.

> Meanwhile, none of this is relevant unless you're building your own cloud platform, which is a huge waste of time for most companies.

Again, this is exactly why it's relevant, 99% of people aren't building their own cloud platform, so they don't need the full isolation of VMs -- most of the time they're just trying to prevent program A that their devs wrote from clobbering program B that their devs also wrote when they both run on the same machine, and making it easier to deploy the dependencies that come with each.

> MicroVMs, containers, VMs, zones and bare metal are places to execute code. Serverless makes all those distinctions irrelevant.

OK I don't even really know what this means, you know serverless runs on MicroVMs right? and most of the time it's actually containers in MicroVMs? MicroVMs are just stripped down versions of regular VMs, and no one is in a zone unless they're running Solaris.

The distinction between these things is still very important, unless you mean that the future is everyone just deploying functions for their applications? But even if you mean that, cold start is basically the first stumbling block you see and it literally exists because of the distinction between how these technologies work (and how fast they can be started on demand, with how much isolation).

> Sorry if you spent 2015 getting really into Docker. You bet on the wrong horse. It's OK, this happens in tech.

Is there anyone that spent 2015 really getting into Docker and isn't better for it now? The technologies that have sprung out of this part of computing are very valuable to know and are getting more valuable, not less. You don't have to install a VM to run an isolated postgres instance on your dev machine because containerization exists -- if you're still doing this you should probably look into updating your tooling.

Also, don't forget that containerization is how some of the richest and supposedly best (due to their ability to spend money on engineers) companies in the world have been handling deployment for nearly a decade -- 2015 is late for realizing containerization is a good thing, not early.


> you shouldn't be using docker to try and contain possibly malicious code

Indeed. And since we're not building our own cloud environment, because that's a waste of resources for most companies, we will share an environment with possibly malicious code so therefore need isolation.

> unless you mean that the future is everyone just deploying functions for their applications?

Yes, that is exactly what I mean.

> You know serverless runs on MicroVMs right?

Yes, that's why I mentioned them

> and most of the time it's actually containers in MicroVMs?

I doubt this - AWSs performance documentation focuses on MicroVMs as an alternative to containers, not an addition. Which makes sense as containers do less than MicroVMs.

But hey, even if the AWS MicroVM documentation is wrong, it doesn't matter. I am not building a cloud platform. I do not care.

You're right about spin up time. Open a socket and let your apps terminate with it open for a greater chance of reuse. As a FaaS user that's your entire concern with your execution environment.

> 99% of people aren't building their own cloud platform, so they don't need the full isolation of VMs

All cloud environments must provide isolation between customers, hence VMs / MicroVMs. Customers adding docker on top of that add a huge administrative overhead that duplicates the features of their cloud provider for littl benefit,

> Is there anyone that spent 2015 really getting into Docker and isn't better for it now?

Every single person whose product is not Internet infrastructure, that wrote or configured their own unnecessary custom LXC or Docker and VM environment because 'Docker changes everything'.

If a company's product is machine learning for detecting cancer and their ops person has a custom Docker/kubernetes environment they're misusing their employer for their own technical interest

>2015 is late for realizing containerization is a good thing, not early.

It is indeed. It's just that then Docker hype was at its maximum.


> Yes, that is exactly what I mean.

OK, I'd like to note that it's also the past -- CGI was (and in some dark corners still is) a thing.

> Every single person whose product is not Internet infrastructure, that wrote or configured their own unnecessary custom LXC or Docker and VM environment because 'Docker changes everything'.

Yeah but those people now have way easier to run local environments?

> If a company's product is machine learning for detecting cancer and their ops person has a custom Docker/kubernetes environment they're misusing their employer for their own technical interest

???? If your company's product is machine learning, and developers who must work on that product need to set up their environment on their local machines, docker is easier to get started with than VMs, runs faster, consumes less resources. While it might be arguable that it's easier, there are literal money savings to be had by running a docker container instead of a full VM.

Kubernetes has much more complexity and many more tradeoffs involved so I can see that being a much heavier decision.

What you're saying is that overzealous ops people who are looking to pad their own resumes should not be allowed to run amuck, and I agree with that, but docker is not the poster boy for engineering largess -- and I'd argue it never was. Companies and research groups/smaller distributions have been using containerization very productively for a long time.


> CGI was (and in some dark corners still is) a thing.

Yep. Tech does that - think about centralisation / distribution every few years, maybe it'll cycle back to people caring about their own containment tech in future. But not right now.

> Yeah but those people now have way easier to run local environments?

OK, so they wasted their time on their awful custom Docker/k8s thing that runs on top of EC2 anyway, and they have a slightly better way to spin up dev environments?

The rest of the conversation is about dev environments, bare metal, VMs and containers all have their place and I mostly agree with you (obviously containers are only useful when there's a Linux kernel on that desktop, Windows and Mac are bare metal or virtualising for the most part).

> What you're saying is that overzealous ops people who are looking to pad their own resumes should not be allowed to run amuck, and I agree with that, but docker is not the poster boy for engineering largess -- and I'd argue it never was.

You understand my point perfectly. I believe that docker is precisely the poster boy for engineering largess, but this is based on my own experiences (talking to a lot of young engineers in the startup world who love wasting investor money on ops) and it seems reasonable that you have had different experiences.

I think we have a good understanding of where we each come from and can end it here. Thanks for being civil.


Deployment of Docker containers is nice, but deploying a VM as in Vagrant was also fine. I avoided learning anything about Docker for about 2 years because I thought it was just a fad.

However, I would add that for my own personal use, it's invaluable for development work. All that work that you do _before_ your CI or deployment.

1) When I'm working with a collection of tools that I need but are a complete mess with lots of state (think: compiler tools, LaTeX, things like that), then docker image build with its incremental way of running each command piece by piece, and saving the state after each RUN, is actually a life saver. You follow the steps of some instructions, and of course, as usual, there's one extra step not documented in the manual, so you add that to your Dockerfile. You make a mistake, no big deal, just change the command, the bad state is discarded, and you get to try again. You don't have to run the whole thing all over again. And it's instantaneous.

2) When I have to work with a client's codebase, as a consultant, you'd be surprised how many projects do not have a reproducible build, with Docker or anything else. So I end up building my own Dockerfile. The number of times I've heard "but you just have to run this setup script once" -- well, those scripts never work (why would they? nobody runs them anymore). Especially when it begins with `npm` or `pip` -- almost guaranteed to fail catastrophically, with some g++ compile error, or a segfault, or just a backtrace that means nothing. For example, I recently had to run an `npm` install command and it failed with `npm ERR! write after end`. I re-ran the container again, and again once more, and then it succeeded (https://gist.github.com/chrisdone/ea6e4ba3d8bf2d02f491b4a17f...). npm has a race condition (https://github.com/npm/npm/issues/19989; fixed in the latest version). I wouldn't have been able to confidently share with my client this situation unless I had that reproducibility.

3) It's trivial to share my Dockerfile with anyone else and they can run the same build process. I don't have to share an opaque VM that's hundreds of megs and decide where to put it and how long I want to keep it there, etc.

4) It's a small one; but speed. Spinning up or resuming a VirtualBox machine is just slow. I can run docker containers like scripts, there isn't the same overhead.

5) Popularity is a blessing; the fact that I _can_ share my Dockerfile with someone is a network effect. Like using Git, instead of e.g. darcs or bzr.

By the way, you can also do nice container management with systemd and git. There's nothing inherently technologically _new_ about Docker; it's the workflow and network effects; it lets me treat a system's state like a Git repo.

Nix has similar advantages, but I see Docker as one small layer above Nix.


Yes! What a load of trash this post is, should be on the last page!


I appreciate your nuance and defense in this thread.


I don't understand how these are comparable. Hadoop solved a hard problem that nobody had. Docker solves a simple problem that everyone has. It would make sense if you're talking about kubernetes and using it to build hundreds of microservices because it's currently in fashion. Whether you're using Docker, Packer, Ansible or whatever doesn't matter. They are all a solution to the same problem and saying one is better basically boils down to saying which brand of hammer is better.


>Docker solves a simple problem that everyone has.

Docker provides an (IMHO pretty buggy) isolation layer that lies between "keeping things that need to be kept separate in separate folders" and "keeping things that need to be kept separate in separate virtual machines".

I actually don't have the need for the level of isolation below VM and above folder very often. IMHO this level only really makes sense when containing and deploying somewhat badly written applications that have weirdly specific, non-standard system level dependencies (e.g. oracle) that you don't want polluting other applications' dependencies.

I've compiled and installed postgres in separate folders lots of times (super easy) and I've lost count of the number of times people have said "why don't you just dockerize that?" as if that was simpler and/or necessary in some way. That's the effect of "docker hype" talking.


The difference is standardization. Postgres can be installed in subdirectories nicely, as you say -- if you know how. Same is true for JBoss, CI build agent, whatever. Now if you have dozens of these apps then e.g. onboarding people in a remote team suddenly becomes non-trivial (a nightmare, to be precise). With Docker, they can get a complex system running in an hour.

The two primary use cases for Docker is, as far as I can see, is simplifying deployment on varying environments. Variations can happen because of many reasons. Sometimes you have clusters of various sizes in production. Sometimes the environment is a developer laptop. And so on.


>The difference is standardization. Postgres can be installed in subdirectories nicely, as you say -- if you know how.

Or if you have, y'know, a really simple script.

...the kind which also runs inside most semi-complex Docker containers anyway.


I've very rarely seen setup scripts that don't have implicit dependencies on the details of the host environment. Whether it's assumptions about the OS or file system or what related software may be installed or whatever. Often because the original developer can't predict every permutation of the possible interactions because of a combinatorial explosion of possible system setups, sometimes because the script is poorly written, and very often because the language or tooling itself is poorly isolated (eg, pip and npm). Docker is a lightweight way of guaranteeing deterministic setup script execution. Obviously you can still screw it up by pulling from unversioned base images etc, but the numbers of failure modes are limited, typically easy to locate (ie within a single dockerfile rather than across an entire OS), and relatively easy to prevent with some best practices. And best of all, that's true across languages. I can achieve the same effects with pipenv, yarn, or other tooling specific techniques like folder level postgres installs. But then as a developer I have to know the thousands of idiosyncratic pitfalls that occur across the plethora of tools I have to deal with every day. And realistically, I have to deal with a ton of poorly written applications and scripts that I want to execute with some degree of isolation without a ton of overhead. More than I could possibly ever fix.


I've very rarely seen scripts like this make implicit dependencies beyond assuming what kind of package manager is installed. Moreover, everywhere I've worked the package manager was either under our control (in which case no problem) or was mandated from above (we're a red hat shop: use yum - again, not really a problem).

I've spent more of my life and torn out more hair dealing with obscure docker bugs than I have converting scripts from one flavor of linux to another.


Installation is the easy part. Seamless removal and replacement isn't.


It's not whether you need it, it's about becoming a standard. Ubiquity is a strength. It's much easier to learn a few docker commands that will run a container the same way everywhere instead of worrying about distros, folders, config files, volumes, etc. The container registry also makes software distribution much nicer than installing and configuring repos.

It's great that you compile postgres but I just want to run it in a clean and portable way, along with several other programs, and without learning new workflows for each one. Docker containers give people more options to package and run software in a simple standardized process while offloading the tedious system details that don't matter. That's progress.


But this level of work requires an operations guy to know how do do all this right. Most (not all) developers can get to virtualenv or similar tools, but have issues keeping the rest of the system working and stable, or with firewalls, or with system patching.

As an ops person myself, docket saved me lots of time.. defeated can run their containers locally then hand then over to me to stand up. As we move to hosted services, I don't even need to maintain a server. My role of shifting from spending lots of time on ansible and monitoring servers to helping look at code and spending more time investigating weird bugs outside the developers capacity.

I was an early Hadoop adopter as well... And I agree with people's sentiment here -- it was a tool looking for a problem (outside it's specific use case). I used it for it's intended purpose, and I have with it too make it a web crawler to. It actually kinda worked in that regard, but it's not the right usage. It might be able to expand into new use cases though.

Docker solves (again) a real problem in the industry that had existed for decades... And The problems solution keeps going back and forth. Nowadays we train developers, not systems engineers (I've been trying to hire a systems engineer for almost a year and have nearly no bites... Or developers positions get 3 good candidates worth interviewing in 2 weeks or less). This means we have lots of available developers and not enough ops people. Containers help shift the burden to work in this dynamic to -- it simplifies the process to get the devs application to work in isolation. This means 1 ops guy could support a dozen developers and 30 apps on one server relatively easily compared to before. It shifts the burden of the developers runtime environment to the developer... We can still step in too help, but when file that environment is codified in git.

I've been an ops guy for a decade and unlike my positional colleagues I love Docker, it's let me focus on more important things.


I sort of agree with this, but the advantage I see of Docker is in providing a "standard" (ymmv) interface that's less heavyweight than running an entire simulated virtual machine, but less dependent on the idiosyncrasies of a specific environment.

I can give my coworker a docker image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present.


I think the author's point is:

> I can give my coworker a docker image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present.

"I can give my coworker a VM image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present."

and also:

"I can give my coworker a full system container image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present."


Yeah, but a VM with virtualized hardware running a whole OS is so overkill for this use case.


That's not all that docker or any containerization gives you. For most people I've seen it's about infrastructure as code and quickly deploying Dev/staging environments. Yes you can do the same with ansible or terraform but what if you only want to test locally? Do you really want to wait for a VM to be carved out? Or just run a command and have things come up?


Also, not needing to support log extraction or deal with unit scripts or SSH or bin packing applications onto VMs or mucking with ansible/packer/etc. Personally I don’t want to spend my time on needlessly tedious, uninteresting problems.


Just registered a bug issue. On "Steps to reproduce" simply put docker start command, and curl to reproduce an error. Without docker how can i be sure that maintainers will have same env as i have? This is very useful as far as i can tell.


The isolation is done by many other tools, and with less bugs and vulnerabilities.

> applications that have weirdly specific, non-standard system level dependencies

Spot on. 99% of software the world needs can be written against libraries provided by OSes. And then packaged properly.


90% of the time Docker is used to solve the problem of "how do I upload this bucket of Python crud to a production server?" (Replace 'Python' with any other language to taste.)

A slightly smarter .tar.gz would have solved the problem just as well.


No it wouldn't have. What's unzipping and running that code? What's monitoring it and restarting it? How do you mount volumes and env variables? How do you open ports and maintain isolation?

A container is vastly more powerful for running an application than a tar file.


No offense, but you're aware you make it sound as if we used to use punch cards until the arrival of docker?

You can often run daemons as different users and set appropriate file permissions. You can add ENV variables to your start up scripts or configuration files. Volumes are mounted by the system (and you set appropriate access rights again). Monitoring and restarting services is managed by your init system (and probably some external monitoring, because sometimes physical hosts go nuts). Depending on your environment you can just produce debs, rpms, or some custom format for packaging/distribution.

Yes, sometimes you still want docker or even a real VM, and there are good reasons for that - I totally agree. But often it is not necessary. I'm often under the impression that some people forget that the currently hyped and cool tech is not always and under every circumstance the right solution to a given type of problem. But that's not an issue with docker alone...


>You can often run daemons as different users and set appropriate file permissions. You can add ENV variables to your start up scripts or configuration files. Volumes are mounted by the system (and you set appropriate access rights again).

That sounds exactly like creating a Dockerfile. The difference is that your script has to work any number of times on an endless number of system configurations. The Dockerfile has to work once on one system which is a much easier target to hit. The "any number of times on an endless number of system configurations" is a problem taken care of by the Docker team.


Which sounds exactly like having a proper package manager...


You seems to be not aware of the problems docker solving “out-of-the-box” and that about 10-15 years ago, those problems was solved in-house developed toolset.


The difference is that with VMs, you have to configure the things that you get for free with container runtimes. Specifically, Amazon can take care of a ton of the most mundane security and compliance burden that our org would otherwise have to own. Those differences means that developers can cost effectively be trained to do much of their own ops and I can solve more interesting problems.


basically everything you described basically came with systemd. the thing that later made docker possible.

before it was just a mess. and it also isn't that much older than docker.


Exactly. The said reality that solving this problem helps many developers who would otherwise not capable of deploying to servers.


Much like accessibility, even if you are capable of doing everything by hand, it is usually nicer to have most of the gruntwork handled for you.


Well, not necessarily by hand. I never gave up deploying our Java application with Ansible. We could have used Docker but the team decided to use fat jars and Ansible instead. Nowadays with Java 11 you can make those fat jars even slimmer. There was no value proposition for us to change.


I did not work much with deployments but one thing I liked with Docker over Ansible is that testing the configurations locally is really easy and independent on the host platform.


You are confusing Docker = the deployment artifact with Docker = self-contained runtime perfect for CI/CD

Longer answer here https://thenewstack.io/docker-based-dynamic-tooling-a-freque...


I'm not.

My point was that Docker purports to solve the sandboxing and security problems.

In reality, this is something that 90% of people who use Docker don't give a shit about. For the vast majority Docker is just a nice and easy-to-use packaging format.

The sad part is that

a) Docker failed at security.

b) In trying to solve the security problem Docker ended up with a pretty crufty (from a technical point of view) packaging format.

Maybe we need to start from scratch, listen to the devs this time and build something they actually want.


>My point was that Docker purports to solve the sandboxing and security problems.

Says who? The article I linked to you says nothing about security.

>Docker failed at security.

If somebody thinks security is the strong feature of Docker he/she is misinformed.

>For the vast majority Docker is just a nice and easy-to-use packaging format

For the vast majority of who? Developers? Sys admins? PMs?

The big advantage of docker is the self-contained environment for CI builds.


> A slightly smarter .tar.gz would have solved the problem just as well.

It's called "OS package" ;) and can provide more strict sandboxing using a systemd unit file: unit files provide seccomp, cgroups and more.


docker solves 2 problems. first is you have no control over your devs and allow them to install any software from anywhere. and second is you want to sell cpu time from the cloud in an efficient way (for the seller).


Disagree with both statements.

1) is not a containerisation problem. It’s a team problem. I can jam in a load of npm and pip installs in to a shell install script. Maybe even delete /usr/ for the hell of it. Because the script isn’t isolated from the OS I can cause more damage.

This problem is actually solved by doing code reviews properly and team discussions.

2) errr no. Containers != infrastructure. If you want to deploy on bare metal, you can.


Agree with the first one but disagree with the second one. EC2 was selling CPU time long time ago before Docker existed.


Docker containers provide seccomp, cgroups and more.

Yes, systemd unit files are containers, just like Docker.


Indeed you're right, but the problem is that your devs' machines and your production systems are running different OS's/distributions.

Nix tries to solve this, but it isn't there just yet.


I know there is cost to this solution but it's a good one:

Use the same OS and similar hardware for development and production.


The cost includes making development impossible without internet access, given that devs are not going to be carrying a cluster of servers around with them.


^ This.

Also means developers can work in whatever environment they want, but the result will be reproducible (almost) anywhere.


> Hadoop solved a hard problem that nobody had.

This baseless assertion is patently wrong on so many levels. Building computing clusters on COTS hardware is a very mundane problem. Running processing jobs on data shards is a very mundane problem. Scaling COTS clusters transparently is a very mundane problem.


Insert "almost" into parent's sentence and it becomes correct.

Many people use/used Hadoop for problems that did not warrant the overhead and complexity that comes with Hadoop. I've seen it countless times with my own eyes that people pre-emptively use tools like Hadoop and Spark because of a chance that they will hit a massive scale in the future.

This happens in both startups and enterprises alike: people like to think they have big problems too often.


This. A prior employer spent millions on a Cloudera install that had 5 worker nodes and was being used to service a bunch of generic jobs.

Worse still, it didn’t even use HDFS and we eventually got sick of the crappy embedded Zookeeper/Kafka setup.


> This happens in both startups and enterprises alike: people like to think they have big problems too often.

A.k.a. resume-driven development. Having Hadoop on your CV looks sexier than awk.


I was at a conference once where someone was presenting and describing something where they used Hadoop to do pretty simple processing on ~30MB of data and the audience appeared to be lapping it up.

I just sat there thinking I could probably run what they did on my phone.

And just to be clear - it wasn't a PoC or a demo.


> Insert "almost" into parent's sentence and it becomes correct.

No, it still remains astonishingly wrong. Even container orchestration platforms are being adapted to provide the same service that Hadoop has been providing for years, and no one in their right mind would claim that running processing jobs on the cloud is a problem that almost no one has.


There are broadly 3 levels of the amount of data that people have:

- Fits on one computer (most of the market)

- Fits on several computers (most of the rest)

- Requires a significant cluster of machines (50+ to store it)

Hadoop only really solves the last one. It has huge overheads in terms of speed and in terms of resources and headcount to run it properly, so it only makes sense at a particular scale. It's like a mainframe – most companies shouldn't buy one.

If you add to this the fact that Hadoop was about batch processing, and its "realtime" capabilities were poor, there really aren't that many potential customers, and many of the potential customers would rather run it in-house, or build their own system.


> - Fits on one computer (most of the market)

There's one category above that, which is "Fits in memory" and that is a huge chunk of the market. I've seen first hand people getting way too cute and complicated planning for scale, and then it works out that they don't even have more than a couple of GB of data.


I remember distinctly a conversation I had in early 2012 about what I jokingly called "big memory" as killing off the (then ubiquitous) "big data" trend. Most people's sensibilities have not adjusted to a world where getting 64GB of memory in a server is a triviality, and you can 8x that without too much effort or cost.

Unless you're storing media, or you are truly "web-scale", your business data will very likely fit in 512GB.


I've heard of a startup spending many days of contractor time to optimise their database so they didn't have to go from the 4GB Heroku database to the 8GB one.

I've heard of a company refusing to purchase an external drive for an employee so they could process a handful of ~50GB datasets on their MacBook Air – instead forcing them to use "the cloud" or constantly download and backup datasets.

I've heard of companies doing extensive work to set up Hadoop to process a few GB.

Roughly I'd suggest that "fits on my laptop" is <1TB, "fits in memory" is < 1TB, "fits on one computer" is < 10TB, "fits on a small cluster" is <100TB, and "might be worth Hadoop" is >100TB. I could be too low on these though.


> I've heard of a startup spending many days of contractor time to optimise their database so they didn't have to go from the 4GB Heroku database to the 8GB one.

From Heroku's site, Heroku's 4GB database plan goes for 50$/(instance.month) while Heroku's 8GB plan goes for 200$/(instance.month).

Therefore, it isn't a question of if it makes finantial sense (it does) but how long the startup plans to operate to recover their investment.

> I've heard of a company refusing to purchase an external drive for an employee so they could process a handful of ~50GB datasets on their MacBook Air – instead forcing them to use "the cloud" or constantly download and backup datasets.

I find it rather strange how someone believes that it's a decent idea to conduct a company's data analysis work on what an employee manages to fit on an external HD, as it creates a whole lot of hard problems both legal and technical. I mean, how do you ensure the data's provenance is tracked and other data analysts can access the data? Who in their right mind would put himself on a situation where a minor lapse or misfortune (losing/getting the HD stolen) could put the company at risk?

> Roughly I'd suggest that "fits on my laptop" is <1TB, "fits in memory" is < 1TB, "fits on one computer" is < 10TB, "fits on a small cluster" is <100TB, and "might be worth Hadoop" is >100TB. I could be too low on these though.

That's a rather naive and missinformed take on Hadoop. Hadoop might be conflated with big data but it's actually a distributed system designed to reliably process data shards without having to incur a penalty to move data around. It makes absolutely no sense to base your assertion on data volumes alone. What matters if it the performance increase justifies setting up a hadoop cluster with the resourses available to a company.


I'll correct you: Hadoop solves a hard problem that lots of big enterprises have. A distributed file system is a big deal, and everything that on top also is.

Now making (big)money with its ecosystem is another question.


...except that HDFS has never been a distributed filesystem, despite the name. It's an object store with just enough hackery so that a particular set of apps with very limited needs could be modified in a short time to use it instead of a filesystem. Arguably that's still a big deal, but nowhere near as impressive or generally useful as a distributed filesystem. To me it's a bit amazing that anyone involved in its design still has a job in this industry.


...ok, then what are the alternatives? And why is it so bad?

Something that can handle hundreds of terabytes on hundreds of machines and provides useful tools on top of the whole thing (Spark, Hive, etc)?


Well, rather topically, there's MapR. Depending on the desired scale and level of fidelity to real-filesystem semantics, one might also count Gluster, Ceph, and Lustre. Don't know if PVFS is still around. There are more proprietary offerings from IBM (GPFS/SpectrumScale), EMC, NetApp, etc. plus a plethora of startups. There are other object stores that don't make a pretense of being filesystems. Alternatives abound.

BTW, "hundreds of terabytes on hundreds of machines" isn't interesting territory any more. Most people's needs are far smaller, so HDFS isn't much help. Those who need more generally need much more, so HDFS isn't much help again. Richer semantics are nice either way. Imagine thousands of machines with dozens of terabytes each and you might start to see the problems with HDFS's design (though you'll still be far short of the domain I work in).


The thing is that this is exactly my company's scale, and hdfs is perfectly fine here. :-)

And I am curious about possible alternatives: open source, about 100-200 machines, with good support for analytics and SQlish systems.


Solutions to hard problems of big enterprises should be the most juicy way to make money no?


I am not a certified specialist in Hadoop-related things but there's definitely an industry of people working with Hadoop/HDFS/Spark/etc. It's almost a standard in Business Intelligence where I work these days.

So there should be some money there.


HDFS is Hadoop, other parts being MapReduce, YARN and HDDS.


Big problems mean big projects, big risks, big staffing plans, big costs, and smaller margins. If you can bear the cost of sales, ten $10mm projects beat a single $100mm in my experience.


"Whether you're using Docker, Packer, Ansible or whatever doesn't matter. They are all a solution to the same problem"

Nope. K8s is a cluster operating system that happens to fit very well with the microservices architectural model. Packer is a way to create classic VM imgs (this one is similar-ish to Docker but only if you only care about the 10000 feet non-technical-at-all image. Ansible is an infrastructure as a code tool. You deal with mostly classic infra components and compositions of them as code. Putting all them in the same bag is like saying that all programming languages solve the same problem. It's only true if we cut the conversation down to a level where we consider all digital devices as the exact same thing.


>> Hadoop solved a hard problem that nobody had.

I'm genuinely curious, if you want to search over big data, which should be a pretty common procedure these days, what alternatives are there to a distributed file system? A dfs seems very complex to me. And it is not clear too me what alternative system designs a dfs will outperform. Is a dfs the only solution to big data?

Relational DBs do break down at a certain scale. What system do you turn to next? Nosql? Will that scale infinitely? Will any system scale infinitely?


What's common about big data?

My guess is that almost all programming jobs are in fields that produce no more data than a gb or two a month.

I'm happy to be proven wrong, but I would guess that there are far more companies making project management software, time tracking apps, invoicing software, etc. than there are facebooks, googles or reddits obsessively logging every user mouse twitch.

And that's data that's much better sitting in a nice, normal, relational database.


>> facebooks, googles or reddits

Yes, it definitely seems with market leadership comes big data. Seems to me big data is highly relevant. More relevant now than ever.


But the point is, vast majority of companies are nowhere close to Facebook/Google/Reddit wrt. data scale, and have no need for tools applicable at Facebook/Google/Reddit scale.


Relational databases work just fine for terabytes of data. Aparently there are postgres deployments with petabytes of data. There aren't all that many companies with datasets that require anything else. Google and Twitter have the problems Hadoop solves, but it's fair to say that approximately nobody else has those problems.


Whether Postgres (or similar) can handle the data depends not only on the data size but also on what you want to do with the data.

I.e., if you are generating reports, running aggregations over a large amount of data you definitely need some parallelism and Postgres isn't designed to handle these loads (certainly not petabytes). Even aggregating 100's of GB probably requires (or at least is more cost effective using) multiple machines.

Now Hadoop may not be a particularly efficient solution unless you need 100's of machines. But there is a limit to what a non-parallel single machine database can do. There are other solutions in-between.

And you really don't have to be twitter or google to handle significant amount of data these days. People are recording much more data in the hope of generating new insights and do need tools to process that data.


> I.e., if you are generating reports, running aggregations over a large amount of data you definitely need some parallelism and Postgres isn't designed to handle these loads (certainly not petabytes). Even aggregating 100's of GB probably requires (or at least is more cost effective using) multiple machines.

Aggregating 100s of GB isn't much of a problem for PG these days. Yes, you can be faster - obviously - but it works quite well. And the price for separate systems (duplicated infrastructure, duplicated data, out-of-sync systems, ...) is noticable as well.

But yea, for many petabytes of data you either have to go to an entirely different system, or use something like Citus.

Disclaimer: I work on PG, and I used to work for Citus. So I'm definitely biased.


I'm genuinely curious. Can PG handle hundreds of users querying 100s of GB, sometimes the same set of tables, at the same time?


Well. You're going to run out of CPU and memory bandwidth pretty quickly. So you'd need replicas to share processing load.

But I honestly don't think hundreds of users each querying 100s of GBs is all that common.


Again, genuinely curious to learn if you've experience with enterprise companies using 'big data' technologies.


Yes.


Not all relational DBs break at scale. Relational DBs break at scale if you rely on very specific optimizations like specific types of indexes.

Even Postgres got this right recently with the introduction of the BRIN index, which is a lot more lightweight.

Look at Netezza, Oracle Exadata, and (disclaimer: I work on this) SQream DB, which can absolutely handle hundreds of terabytes without too much fuss.


There's also CitusDB, which performs much better than Hadoop because it doesn't have the bloat of the JVM or the inefficiency of brain-dead HDFS. You can write your queries in a few lines of SQL rather than reams of convoluted Java code.


Which of those indexes/databases should I be using if I want to query over big data using natural language?


This question makes no sense.

First, NLP based search can be executed on top of any engine (APIs are very handy), relational, kv, graph, filesystem .. so that part is totally irrelevant.

Assuming "big data" in this context is still relational data, then any of those systems would suffice, within their own particular tradeoffs and features.


There is no direct relation between the query language and the underlying infrastructure.

If you're talking about taking some questions and getting graphs from them, ThoughtSpot does a good job.


The issue is that 80% of "big data" isn't

https://adamdrake.com/command-line-tools-can-be-235x-faster-...


That's not at all what my issue is though. I have what you might call "big data problems" and I'm so turned-off by Hadoop that I'm actually rolling my own distributed DB. But what principles other than a DFS should one look into, is the question, really. At least to me.


It's a reaction to your statement

> if you want to search over big data, which should be a pretty common procedure these days

It's not a common procedure by any stretch because the vast majority of datasets aren't really "big".


There is no such thing as 'big data' in terms of data, there is proper and improperly designed programs, Hadoop is in the latter.


They're comparable in the sense that neither technology had enough market demand to sustain the companies attempting to sell products. Great tech that we all benefit from, but the monetizable bits were not worth the investment.

Then again it's called venture capital for a reason so this isn't exactly unexpected. The question should really be more about the scale and hype that was involved.


Wait, what problem does Hadoop solve?? Hadoop is an extremely poor reimplementation of some Google services. The time when billions went into Hadoop startups even Google stopped using MapReduce because it turned out to be inefficient and very limited to many of the distributed computational problems. MR was extremely well suited for a single thing: aggregating web logs and computing very simple summary statistics. What is the hard problem you mentioned?

Btw. there are many very successful startups in the big data space that understood the limitation of Hadoop and addressed almost every if not all aspects of its shortcomings. A good example would be Snowflake computing.


>> Google stopped using MapReduce

How are they then querying over big data these days?

We don't know, do we? Or did they open-source their search engine?

By using Hadoop people are trying to not reinvent the big data wheel, partly because it's a motherfucker of a problem to have to solve and party because they want to solve the business problem, not the technical one. I don't see how that is in any way worthy of being frowned upon.



The same concepts are still there, it's just a nicer API now. Think Hadoop -> spark.

Map -> map, filter, flatmap, etc

Reduce -> reduce, joins, folds, group by, etc

Those other concepts were always expressible as map and reduce, of course, just with a bunch of annoying repetitive work


Dataflow looks like it is just their public version of flumejava (both mention pcollections) [0] which indeed is basically a just a bunch of mapreduces pipelined into a fancy directed acyclic graph.

Here is the abstract: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient dataparallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google" [0].

[0]: https://ai.google/research/pubs/pub35650


Yeah, reduce should have really been named groupBy or something. The shiffle-sort to group by key is the important part.


> The same will eventually be said of Docker. I’ve yet to hear a single benefit attributed to Docker that isn’t also true of other VMs,

I bought a Raspberry Pi and using a few commands, installed pre-configured Docker ARM images for 8-9 different media applications that would've taken me days to setup and manage individually. I didn't have to worry about dependencies or compilations. It just worked.


Properly packaged Debian/raspbian apps are still an “apt install” away. Your use case, which is common, tells me that packaging / distributing may need some love, not that there’s a fundamental difference.

And the convenience does not come free - a random docker is almost as bad as a random executable.


Packaged and configured are two different things. The fun begins, when you need several applications, which each have different ideas about configuring services, that would be otherwise shared.

Then, as a matter of convenience, you get packages, that either bundle everything preconfigured to their liking (i.e. gitlab omnibus) or packages, that configure shared services how they need it, and running them shared is not supported (i.e. freeipa, which will configure httpd as it needs and forget having anything else served on the same machine).

Docker provides a way to isolate these, so you can still use the same resources you have to run applications, that would not cooperate with each other on a single machine, without having to run separate OS instances in separate VMs.


I mostly Ubuntu these days, have been for 15 years, and I cannot recall a time installing a package like Freeipa disabled a website installed by other packages. Obviously YMMv and I might have just been lucky.


With freeipa I meant server, not client. Installing is not enough, configuring it into usable state is needed too (i.e. running freeipa-server-install or freeipa-replica-install).

It is possible to run other services of such configured httpd, but you need to be careful. When freeipa wanted mod_nss, you had to use mod_nss, not mod_ssl (though they switched it); when freeipa wants to use gssproxy, you are going to use gssproxy too. These changes can happen during upgrades, and it is up to you to fix everything after such change.

The project doesn't recommend to run anything else with the same server; you are free to try though.

The point was, that with docker or another container system, any such problems are irrelevant, and it allows you to have separate service instances without having to run separate VMs.


Packaging, in the way it is used in Linux, just needs to be taken out back and shot in the head. If it were any good at all, you wouldn't need so many package managers, package maintainers, repos, etc just to get software. You wouldn't have to worry about updates to your system conflicting with what's already installed, and you wouldn't have to compile things from source or juggle PPAs to get software in a timely manner direct from the developer.

One of the reasons Docker is so popular is that it bypasses that garbage fire.


You don’t need so many package managers. they exist, but they are virtually equivalent these days. Docker is (almost) alone as a result of timeline.

And the ppa juggling is somewhat accidental historical complexity, and some trust management issues.

Non-official dockers in widespread use will, I believe, explode as a security nightmare sooner than later.


And nobody cares. Because actually being able to use software, even if it is insecure, trumps not being able to use it.


>> Properly packaged >> packaging / distributing may need some love

I am yet to see a single project which is relatively new and is packaged on Debian (let alone Raspbian). And lets be honest - nobody is genuinely using Raspberry Pi to run an outdated media server.


Debian unstable (and even testing) has many. Ubuntu PPAs do too.

Raspbian might not, I don’t follow it.


> a random docker is almost as bad as a random executable

You can sandbox a random executable with seccomp, you cannot effectively sandbox a whole container without breaking many things.


you know that Docker (by default) includes a seccomp filter as one of it's layers of isolation, right?


The problem of installing prepackaged applications was solved decades before Docker was a thing.


Evidently not or Docker wouldn't exist.

Yes, Docker is a shitty solution to just packaging applications, but it exists because Linux developers keep saying "apt exists. It's solved! What's the problem? Static linking? Why would you want that? Portable binaries? But apt is the only place you need to put your app so you don't need them!"


even considering puppet etc things weren't this simple, the base image depends on the provider so you'd get variants depending on the kernel and glibc version, especially with openvz based host

now, on bare metal server and premium providers where you're able to specify the boot image that wouldn't have been a problem, but there are lot more people dealing with cheap housing giving you whatever image they baked last than people doing bare metal deployments and for the former docker is an order of magnitude easier to deploy


Exactly.

"other VMs"? The whole point of Docker is that it's not a VM...


> installed pre-configured Docker ARM images for 8-9 different media applications ... It just worked.

Often the applications are packaged by random people on the Internet and do not receive security updates.

There's plenty of evidence showing how bad the problem is and there's no way around it.

You need the security team of a distribution to backport security fixes into a stable distribution and a large user community test them.

Only with this you can run apt-get upgrade without breaking things.


Great but how does that help recoup the investment the article mentions?

I’m making no comment on the specifics of the Docker or Hadoop ecosystems as I have no skin in either game but history is full of useful tech that didn’t make money.


Cloudera earnt $145m last quarter and grew by 37% over the previous quarter. Other Hadoop startups like Databricks are doing well and Docker has gone from a 2 digit revenue to a 3 digit revenue company from 2017-2018. How has billions been wasted when we have successful companies doing well against the toughest competitors ever i.e. Google, Microsoft and Amazon.


Well that include revenues from Hortonworks also. With revenue rise losses are also rising to $85.5M. I feel in couple of year they will fold up or brought by some big cloud vendor.


You can't really compare VM with Docker. Managing containers with Docker + Kubernetes is far easier then managing VMs. Docker might be replaced with something else in future (i.e. rkt), but basic concept IMHO is here to stay.


small aside, but it looks like rkt is not going to be that replacement. Recent disclosure of vulnerabilities in rkt that Redhat are apparently not going to bother patching indicates they don't currently see it as an active project.


rkt is basically dead and so is rktnetes.

However, docker internals have basically been replaced with containerd[0] in this point -- the two front runners in the battle to actually run your containers (and power higher level abstraction tools like docker) are containerd[0] and cri-o[1].

I personally prefer containerd, but there are a lot of people who are obsessed with cri-o (big company backers, from what I can remember), despite the fact that it's chronically behind on features (for example alternate runtimes, runtimeClass support), but they're both excellent.

Note that there are also other projects like podman[2] that also aim to serve as docker replacements.

Discussion on containerd shim in docker can be found on google group[3] way back in 2016.

[0]: https://github.com/containerd

[1]: https://github.com/cri-o/cri-o

[2]: https://github.com/containers/libpod

[3]: https://groups.google.com/forum/#!topic/docker-dev/zaZFlvIx1...


What about setup and maintenance of the infrastructure that hosts either VMs or containers?

Given a mediocre number of physical machines (say, 40), which is easier, install, setup and maintain a VMWare cluster or a Kubernetes cluster?

If someone has any insight, preferably backed by actual experience, it'll be most appreciated.


Assuming unlimited budget for hardware and licenses, then building a team that can deploy and manage VMWare on 40 nodes will be much easier - however raw VM is not really comparable to what k8s is giving you, so you'll be also solving reproducible deployments, load balancing traffic to your cluster etc. Still, with enough money those can be solved by purchasing more hardware&software, and you'll have easier time finding people who can maintain that over on-premise kubernetes.


Many thanks!


VMs are application agnostic, while Kubernetes is a service manager.

If you want to run 200 different services on 40 machines, you may find hand crafted VMs easier to create and forget.

If you want to run 10 services with different levels of replication and redundancy on 40 machines, then Kubernetes will do that for you.


Isn't this the way it normally works though, a bunch of investments don't work out - those are wasted - those that do work out, the people who did the investing get more money back.

Or to put it another way: There must have been some few Hadoop investments that worked out, the same will eventually be true of Docker.


I think what they've noticed is a similar arc where both products got initially sold as being universally required and applicable and generational which justified enough investment to spawn a whole industry.

And instead of fulfilling such dramatic hype they're both just good tools that are far from universally needed, not objectively superior to all other options, and there's nothing special about them that will keep them from getting supplanted by newer tools, which is the norm for even the industry even if the tools are good.


Shhhh get out of our ivory tower with your statistics about how the real world works.

/s

But yes, there's no reason to think that the distribution of successful, neutral and failure returns for Docker centric startups won't follow the usual distribution.


> ...but standard VMs allow the use of standard operating systems that solved all the hard problems decades ago, whereas Docker is struggling to solve those problems today.

What are these supposed "hard problems" the author speaks of?


Sandboxed, consistent environments to run code in?


Although with recent adventures in speculative execution, that sandboxing isn't quite as 'solved' a problem as previously thought


All this money should have been spent on developing new, improving existing or switching to better (operating) systems which solve the resource and communication security problems, instead of creating another inner-platform effect.

I hope WebAssembly goes in this direction, instead of trying to adapt to current programming language paradigms.


>All this money should have been spent on developing new, improving existing or switching to better (operating) systems which solve the resource and communication security problems

But this isn’t the problem Docker is trying to solve. It’s just a problem that Docker needed to solve in order for their product to be useful, this is completely transparent to Docker users. Docker abstracts away a whole bunch of work you’d otherwise have to do to implement repeatable builds, it makes those builds widely distributable, and (depending on how you choose to use containers) can also simplify some capacity planning problems.


I thought Docker builds are not generally repeatable, since they often `apt-get update && apt-get install`, which depends on the current state of external package management?

They are definitely not reproducible in the sense of building bit-for-bit identical containers, unless you use Bazel.

That being said, I've found Dockerfiles to be a much more reliable build process than most others (recently struggled to get through LogDevice's cmake-based build.. ugh).


You’re correct, but how reproducible you’re builds end up being depends on how you use it, and how reproducible you need them to be depends on your use case. Maybe a particular use case wouldn’t fit in very well, maybe it would be better served by using something like packer, maybe your dependency management requirements mean you should use something like artifactory. No technology is going to be suitable for everybody’s needs, but Docker provides enough value to enough people that it’s found a place in the market. If it Docker dies, I’d imagine it would be because it was replaced by something better, not because people suddenly realized that they weren’t getting any value out of it.


and environments as code


Seriously, who reads this kind of garbage? It's painfully clear to anyone with any experience with Docker that the author hasn't even skimmed the wikipedia page. Reminds me of the idiots blasting out blog posts about how BITCOIN IS THE FUTURE one month, then BITCOIN IS A SCAM the next.


> I’ve yet to hear a single benefit attributed to Docker that isn’t also true of other VMs

Oh boy


Who is this commenter?

Containerised applications are commonly used. At this point it's a proven technology with clear use cases.


The difference being that people actually use Docker


The article just quotes the news about MapR and then asserts the same will happen with Docker. That may be true, but there’s no evidence here.


I suppose I haven't worked in environments scary enough that Docker was a necessity.

As a result I find it difficult understand the hype.


I have come across quite a few articles that mention that for 99% of 'big' data problems, Hadoop and the like are an overkill. Simple tools, with a beefy machine is just as sufficient for the task.

Is that the reality of today?

Personally I too feel that distributed computing is an overkill for most 'big' data problems.


I've heard people say that a huge portion of the "big" data is garbage, throwing more garbage at something won't make it better, so you'd get much more out of it by cutting it down to a more manageable amount of good data, which you can then easily process in postgres or whatever.

But even if you "collect everything and sort it out later", in my own personal experience and in what I've read here on HN, you can go a long long way before you need to reach for the power tools. What most companies call "big data" is typically not that much (in quantity and in velocity). Most companies don't have tens or hundreds of terabytes of data. For example, I'm currently processing timeseries data in postgres using the timescaledb extension, which makes it perform very well. Still too early to state numbers, but its looking promising so far and if their claims are true, then I won't need anything else. We will see :)


Yes. Very very very few people have data as big as "the internet" and the need for speed (which was the reason Google developed a look of their distributed tooling) that you get with these frameworks.

And it really only makes sense to have permanent infrastructure for distributed computing if you're constantly using it. Like if you're constantly rebuilding an index of the internet. Which most people aren't.

For occasional reindexing jobs, I've personally had success with Kubernetes. Our customer facing services were deployed with it, so it was trivial to bump up the number of nodes, and then schedule a bunch of workers containers, then roll everything down when the job is complete. No need to learn the ins-and-outs of Hadoop.


Which is probably why these Hadoop companies have very few customers and can probably never grow bigger beyond a certain point.


It is very, very, very common for companies to have enough data to need a distributed approach to ETL and especially common if they are doing any machine learning which is most of them.

In telcos you have network telemetry data. In supermarkets and retail you have purchase data and often credit card gateway data. In banking and finance you obviously have transaction data.

And with Kubernetes you still need a compute framework. Like I don't know the standard in the industry the Hadoop Spark framework.


A lot of companies think they need machine learning because the management consultants tell them they do (and hey, we can set it up for you!). And those that are "doing machine learning", often aren't doing it very well and have trouble applying it to the business.

Note I'm not saying all. Just 99%, like the parents comment referred to. That leftover 1% are the companies you can name off the top of your head. ExxonMobil, Target, Chase, Visa, etc.


It really depends on how "big" your data is.

If you have data that is actually big (= doesn't fit onto a machine), then Hadoop is a reasonable candidate. Otherwise you are fine with a lot simpler tools.

This has probably been true for quite some time, and companies are just slowly realizing now that their relevant data isn't actually that big. On the other hand computing power and storage has still grown in the last years, and less resource intensive ETL has become more acessible, so the bar for "big" data has been raised quite a bit.


About 5 years ago I worked at a small, business oriented telco. The biggest ETL was processing external CDRs, applying call tariffs, and creating bills and reports. A previous developer had been adamant about using a Hadoop cluster to process this, storing all of it in Cassandra NoSQL.

The idea was interesting, but it didn't quite work out. At some point he left the company and we had to do something about the pipeline, as it was crashing most of the time. We did some calculations, and figured that with the right approach, a good old MySQL+PHP solution would do the trick. And it did.

Having switched jobs myself in the meantime, I'm happy knowing the system is still running, and finding people to maintain it is relatively easy.


Maybe that developer knew something unique about that use case i.e. data needs were expected to grow or they had plans to use it for Data Science (very common in telcos). And you just weren't aware of it.

It's always easy to pass judgement at technology choices but in my experience they are often made with the best intentions based on requirements that not everyone is aware of.


More likely he knew that the pay for a hadoop / bigdata specialist was a lot better than for a mysql specialist, especially 5 years ago.


And the guy's resume probably says

> Implemented big data / real-time Hadoop streaming ETL service processing billions of requests.

I've become very skeptical of anyone who puts a combination of buzzwords and pseudo-numbers in their resume.


Doesn't matter what he "might have known or intended to do", he didn't build a solution which met the specifications the business needed.


You really didn't get my point did you.

We don't know about the specifications because we don't work there and since the OP said he wasn't there he might not know either. And the fact is that requirements regularly change over time.

The point is that it's really easy to judge when you aren't there and are privy to all the facts.


But we do know the OPs solution met the specifications, and is still in use, meaning that those specifications haven't changed too much.

And you're damn right I'm going to be judgemental of someone not only promised the moon, but abandoned the work when the rocket exploded on the pad and someone else had to clean up.


Yes, there are more use cases I didn't knew of at the time. For example, optimizing revenue by calculating the optimal routing strategy for outbound calls. But data analysis wasn't a prority. Nor was scaling storage or computations.

I know this is circumstantial, but on average there were 2 developers. The company needed something to just get the bills out. It also needed to be correct, stable, easily expandable for new datasets (requiring all kinds of conversions), testable from unit to e2e, integrate with other APIs to retrieve metadata, and .. I forgot a few, probably.


It's not that 99% of big data problems can't or shouldn't be solved by those tools. It's that hype makes everyone think their problem falls under the category of big data when it might not.

Hitting a performance or stability limit with MySQL with an unsophisticated schema/architecture does not mean you have big data. But that's the scenario that's common.


99% of businesses don't experience exponential growth. But hope springs eternal in the human breast.


Yes it's still true. Now with stream processing frameworks and infinite cloud storage, you can have a single big machine churn through TBs of data easily.


My beef with Hadoop, and other big data tools, is that for pretty much any task other than outlier detection, sampling works as good, and is cheaper and easy to manage and reason about.

Even Google, the king of big data, will sample your hits on Google Analytics if your site gets too much traffic.


> and number three – well it’s not even worth staying in business because there’s no money to be made,

I guess tell that to the 10s of companies doing PaaS. Or another 10s doing app monitoring / logging. They're successful companies, far from just "breaking even".


Billions are "wasted" on startups in every industry. That's the point. The rich guys gamble and some of their gambles pay off. Most of them fail.


What is a "Hadoop business"? Hadoop is a tool not an industry or product. Can someone explain what the author might be trying to say?


Businesses that based their SaaS data analysis products on Hadoop.

Preconfigured clusters, integration with your existing AWS deployments. All that sort of jazz.


Yup. Tech faabionabilism is akin to Big4/MBB FUD flavor of the quarter... populism/marketing doesn't a necessity make. Docker, Kubernetes, gulp, Hadoop, mosh, Nix, SmartOS, serverless, cloud, virtualization, [insert tech fashion hype > utility here].

Speaking of Hadoop: my vehicle is parked outside one of the HQ's of another top 10 Hadoop startup. It's one of the most expensive, nearly empty buildings in the highest rent areas of the Valley. (Money flushing sound here.)

Fun fact: one of the enterprise Hadoop CTOs is a broney.


I love how the "cloud" is a tech fad in your eyes as well as technologies like Kubernetes which are core to Google and virtualisation which is core to every VPS/cloud provider in existence.

Be a pretty awful world if everyone took your advice.


Isn’t it obvious? Everybody who doesn’t operate their own datacenters are mindless technology hipsters.

Whenever a new big piece of tech comes out, most of the detractors seem to either a) find a use case that isn’t fit for purpose or b) try to use it without bothering to learn how, and the proceed to say ‘see, it’s not all it’s cracked up to be’.


How is mosh in there? Mosh is something to make ssh more reliable, how would that be a trend?


So I've never used Docker, but it sounds an awful lot like FreeBSD Jails. How is it unique?


Much more userfriendly. Works on all major platforms. Big public repository (dockerhub). Extensive tooling for everything. Graphical and CLI tools for all kinds of users.


What is even a Docker startup? Docker the tool is a relief in every way & so is Kubernetes.


How about “service mesh”?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: