> The money was wasted on hype. The same will eventually be said of Docker. I’ve yet to hear a single benefit attributed to Docker that isn’t also true of other VMs, but standard VMs allow the use of standard operating systems that solved all the hard problems decades ago, whereas Docker is struggling to solve those problems today.
Linux containerization (using the word "docker" for everything isn't right either) is an isolation + sandboxing mechanism, NOT a virtual machine. Even if you talk about things like LXC (orchestrated by LXD), that's basically just the addition of the user namespacing feature. A docker container is not a VM, it is a regular process, isolated with the use of cgroups and namespaces, possibly protected (like any other process) with selinux/apparmor/etc.
Containerization is almost objectively a better way of running applications -- there's only one question, do you want your process to be isolated, or not. All the other stuff (using Dockerfiles, pulling images, the ease of running languages that require their own interpreters since you package the filesystem) is on top of this basic value propostion.
An easy way to tell that someone doesn't know what they're talking about when speaking about containerization is if they call it a VM (and don't qualify/note that they're being fast and loose with terminology).
All this said -- I do think Docker will die, and it should die because Docker is no longer the only game in town for reasonably managing (see: podman,crictl) and running containers (see: containerd/cri-o, libcontainer which turned into runc) .
[EDIT] - I want to point out that I do not mean the Docker the company or Docker the project will "die" -- they have done amazing things for the community and development as a whole that will literally go down in history as a paradigm shift. What I should have written was that "docker <x>" where x is "image", "container", "registry", etc should be replaced by "container <x>".
You're right that containers are not VMs, but that's only really relevant as pedantry of technical details.
I think that what the author was trying to say (without really understanding it) was a comparison of containers to VMs as units of software deployment.
I don't think anyone is credibly using containers as a security measure on Linux, because if they think they are, they are in for several large surprises.
Rather, we're seeing the unbundling of software - it used to be that you deployed software to a physical machine with a full OS, then you could deploy it to a virtual machine with a full OS, then you could deploy the process, its dependencies and a minimal OS into a container.
I agree that Docker doesn't have a huge and profitable future ahead of it, because it's providing commodity infrastructure. Rather I think it's interesting to think about what the next level of software deployment decomposition will be, and I'd wager that it's FaaS (ie serverless).
That isn't pedantry, it is an extremely critical point and the one most people miss when figuring out Docker. Both from a security context (docker doesn't provide vm level promises about isolation) and from a resource management side (docker is really close to no overhead).
It is not uncommon to deploy containers on VMs in the real world...
VMs are literally so hard to do correctly and in a performant fashion that parts of CPU instruction sets, and kernel subsystems (KVM) were created to make them easier to run. Containers, in contrast are literally a few flags and a bunch of in-kernel antics.
A few people, notably Liz Rice and Jessie Frazelle have given talks on how to make containers from scratch that are very illuminating for those that are interested:
VMs are "hard" because you start with nothing save some very low-level help from the hardware.
What I should have made clearer was that I think containers are both easier and simpler than VMs. There are both less moving parts (I haven't looked, but I assume less code), and containers are easier to get started with than VMs (set some flags on some syscalls versus make sure you buy the right CPU).
I agree, I think the next evolution of software deployment is definitely heading to sub-program sizes. I do want to point out that we've seen this before, it was called CGI. It's not exactly the same, and things will be better this time (more isolation, better tooling), but if we're doing functions-as-a-service now, we were doing scripts-as-a-service much earlier.
I think the great unbundling of the future is definitely coming and in fact it's already here -- it's just unevenly distributed.
> I agree that Docker doesn't have a huge and profitable future ahead of it, because it's providing commodity infrastructure. Rather I think it's interesting to think about what the next level of software deployment decomposition will be, and I'd wager that it's FaaS (ie serverless).
This was not what I meant to get across -- Docker may have a huge and profitable future ahead of it, but is only tangentially related to the near assured continuance of containerization. Docker as a company and project's goals are different, they do more than simply offer a way to run containers, and they have for a very long time -- my point was that the literal use of the word "docker" should die, because we should be just referring to containerization in the general sense (no matter which lib you're using). It's like the "tissue" vs "kleenex" debate, in a way.
Once you have a properly setup project going and your entire build process is mostly repeatable the benefits start becoming more obvious. Yes, you can do all the same things to a certain extent in a vm but it’s really hard to keep that streamlined and up to date. Having a script that sets up your stack in a vm on both windows and Mac then runs on Linux is a pretty big maintenance nightmare. A dockerfile works with a few commands and can be added to your repo.
It’s not without trade offs but I think if they can solve the issue of debugging in a better way then we’ll really see things solidify on this concept.
This is less true these days (and on macs that had to use the docker machine "hack" it was barely ever true, per say) -- rootless containers are on the way thanks to user namespaces. For example LXC can run fully rootless containers that act more like VMs themselves (as in they will have systemd as pid 1 inside) -- kernel support, user namespaces, and File systems in Userspace (FUSE) make this possible.
IMO One of the biggest benefits to running containers is actually E2E tests -- I don't see it done as often as I should, but it has become drastically easier to run an entire postgres instance for a single local E2E test runs -- I do this on almost every project I start now, I set up E2E tests that spin up the actual world I expect in production (so all the backing services, at the version they will run at), and interact with my application -- this is a huge step forward compared to a huge wiki that has "how to set up the local test VM" -- you can spin up and shutdown these services so fast that you can use them ephemerally (I do) without much worry.
You absolutely don't need to be a cloud person to use containers. For example I use lxc on my laptop to keep different projects separate and test new software.
Here is a general tutorial in anyone wants to test lxc:
100% agree. Most developers don't even bother testing code that requires DBs and other external services because it's a pain to keep test environments up and running so a lot of the most important and buggy code slips through the cracks.
Then you have the developers who do test but just make heavy use of mocking libs and other tricks which aren't helpful in the long run since the mocks don't actually do anything.
But setting up a docker compose with the whole stack - completely reset each time the tests are run - makes integration tests so much more valuable. And usually it's not a huge deal to set up the plumbing since the docker files are often already created for production.
So it's an innovation in packaging.
Harsh. Docker is a nice way to specify how you want your container to work. Sure release Bocker (A better Docker). But Bocker will just basically be Docker: a simple set of instructions to get a machine running.
That is all we want. A few instructions to get a system up. Devs are sick of setting up machines. Demand is there.
Also, Docker itself is very much trying to be more than what it is right now, Docker swarm, compose, and basically everything Docker as a company does to improve their product offering moves them farther away from the way people are using "docker" in conversation right now.
Basically, when people say "docker image" they should be saying "container image" (for lack of a better term, at least).
> That is all we want. A few instructions to get a system up. Devs are sick of setting up machines. Demand is there.
This is pretty vague -- "setting up machines" can mean a lot of things. I want to make it clear that it's not the container runtime's job to set up a single machine, it's to enable a process to pretend that the machine it has access to is it's own. It's a subtle difference but it's worth noting IMO.
If you want to fix setting up your machines, you should be looking at tools like Packer, Container Linux distributions (formerly CoreOS, now Atomic I believe), Nix, Guix, and Linuxkit. Docker does not set up machines, it sets up processes, but happens to bring along a filesystem (amongst other things).
A linux container is not a machine (virtualized or otherwise), it is an isolated process, I agree that devs are sick of worrying about machine configuration when setting up the required dependencies to run their processes, though, containerization is here to stay -- it was around before docker and will be here after.
docker-containers and it's related counterparts are abstractions. Useful abstractions don't necessarily equate to a new line of business.
The best attributes of containers (IMO) are packaging and distribution. What business and operators need is a repeatable, easy way to deploy applications across there infrastructure. Containers are one piece of that story.
The bigger piece, and IMO, where the business viability is, is the orchestration layer. Containers aren't very useful by themselves, you need a way to get your application online. That's where Kubernetes comes in.
You need to understand large organizations and their challenges to see what layer of the containerization stack holds the most value.
Long-term, I see 'linux containers' as we know them going away. The industry is going to move to something like : lightweight, hardware-assisted VM/container hybrids. But, no matter what happens at the containerization layer, the orchestration layer is the piece that adds business value to end-users (eg, not AWS or other hosting providers).
But, starting from that kind of "clean slate" state I have to say that if it takes exasperated internet posts, like your helpful comment, to explain why containers are not like VMs and how they are not like VMs... well then maybe they are not that much not like VMs to make them such a big new thing.
That goes for many things. Like, I don't get the difference between Volley and Beach Volley. One is played on the beach. So it's volley? Played on the beach?
Feel free to make analogies between two other similar sports, such as Competition Karate and Taekwon-Do.
I mean, we can talk about things that make things similar all day, it doesn't make them the same thing. Is a hamburger a sandwich? Sometimes, the whole is greater than the parts, not everything is perfectly decomposable in life.
Because one is about putting the ball through a basket and the other is about bouncing it over a net?
>> Is a hamburger a sandwich?
But, a very special kind of sandwich.
You rarely see hypetrains for old, established things :)
It's only confusing to people who are familiar just with the popularized forms of computer science terms.
Isolation and sandboxing is virtualization. In a container, the applications seem to have an operating system and machine to themselves.
A single Unix process and its address space is also a kind of virtual machine, creating the illusion that the process has a machine all to itself. Thanks to virtual memory, other processes are not even visible; they are in a different name space of pointers. That concept breaks for multi-process applications: processes are aware of each other through manipulations of shared resources like files. Or effects like not being able to bind a networking port because some other process is tying it up. The next level of virtualization is to have namespaces for resource-related namespaces in the system beyond the address space. As far as just the filesystem space goes, we can virtualize with tools like chroot. A group of applications can have their own global /etc configuration, their own version of the C library in /lib and so on. That's the beginning of "containerization".
Yeah given how often people mistakenly call containers VMs or assume they have VM-like properties (like the ability to run a different kernel), this is tripping a lot of developers up, not just lay people.
> Isolation and sandboxing is virtualization. In a container, the applications seem to have an operating system and machine to themselves.
I don't think this is quite true, generally you can isolate and sandbox something without virtualizing anything. In this case to isolate and sandbox without a loss of functionality, we are virtualizing and swapping swap implementations out from underneath an application/process. That is not always the case though -- if you unplug a machine's ethernet cable it is isolated, but you did not virtualize anything, if you deny access to a folder on disk for a process you are isolating without virtualizing, unless you mean that any kernel interaction (like blocking an `fopen()`) is "virtualization".
> A single Unix process and its address space is also a kind of virtual machine, creating the illusion that the process has a machine all to itself. Thanks to virtual memory, other processes are not even visible; they are in a different name space of pointers. That concept breaks for multi-process applications: processes are aware of each other through manipulations of shared resources like files. Or effects like not being able to bind a networking port because some other process is tying it up. The next level of virtualization is to have namespaces for resource-related namespaces in the system beyond the address space. As far as just the filesystem space goes, we can virtualize with tools like chroot. A group of applications can have their own global /etc configuration, their own version of the C library in /lib and so on. That's the beginning of "containerization".
This is a good summary of how containerization works, and the layers of isolation provided by the kernel -- the approach here is isolation by providing a fake, but that does not mean that isolation + sandboxing = virtualization, that just means that these forms of virtualization can be used to provide isolation & sandboxing.
No, not always. Why?.
At work I have a few coworkers pushing hard to dockerize (isolate?) everything.
This makes debugging when things go wrong a lot harder.
I see isolation as one of several qualities a process could have, that sometimes is valuable enough to be worth the sacrifice.
Isolation is not some absolute quality that is without significant tradeoffs.
I've found the opposite to be true. If you have proper observability into your isolated services you can more easily pinpoint the origin of a bug.
Containerization of processes definitely increases complexity but if you can take the time to understand VMs then you can (and should, IMO) take the time to understand how containers work as well, they are lighter and simpler (for example, you don't need to build a kernel or make an initrd). I would argue that people who think VMs are simpler are actually being fooled by huge advancements in tooling over the years and the fact that it's become "easy", not that it was ever simple.
I also want to point out that containers should actually make tracking down some bugs easier, but it does so in a counter-intuitive way -- it removes whole classes of bugs from ever occurring. You'll never have two programs clobber some shared folder or resource, you'll never have programs fight over dependencies, or struggle for locally-bound ports if you're running them in containers.
Containerization definitely represents an increase in complexity, but it is well worth the effort, most of the time, granted you understand the tooling.
I don't see it as VMs vs containers.
We have a good devops process to deploy onto our instances, so we rarely have resource clashes you mention (ports/directories) because none of that is ever configured manually. All our infrastructure is derived from 'scripts', so it hasn't been a problem at all.
Aside from python, I see no advantage in containerizing any of our processes at all.
As for debugging, I always forget how infuriating it is, till in the heat of the moment I have to open up a shell into someone's badly made docker image and try to use common tools to help diagnose a problem (ps, nslookup, dig, all) all missing from the wonderful little container.
It's like being on a big navy ship, stranded in the ocean because the engines broke down, but everyone left all the tools back at the base. Yay!
It's not? I didn't mean to pit them against each other in competition, I'm saying that if VMs are worth learning about and taking the time time to understand, so are containers. It doesn't have to be zero sum.
> We have a good devops process to deploy onto our instances, so we rarely have resource clashes you mention (ports/directories) because none of that is ever configured manually. All our infrastructure is derived from 'scripts', so it hasn't been a problem at all.
It seems like it was a class of problems that you have fixed with "good devops process". I'd argue that it probably was a problem at once point, and you improved your devops process to make sure it wasn't.
> Aside from python, I see no advantage in containerizing any of our processes at all.
Well I don't know your infrastructure so I'm can't comment on that. I doubt that python is the only thing you run that could benefit from containerization (which again, means limiting access to system resources through namespaces and cgroups), but if you say so then I have no choice but to believe that it's the case.
> As for debugging, I always forget how infuriating it is, till in the heat of the moment I have to open up a shell into someone's badly made docker image and try to use common tools to help diagnose a problem (ps, nslookup, dig, all) all missing from the wonderful little container.
Sounds like you could use some more of that "good devops process" you had when you set up the deploy machinery.
Also, the fact that all of that stuff is missing from the container is actually beneficial from a security point of view -- the same inconvenience you're experiencing is the same inconvenience an intruder would experience first before breaking out of the container (assuming they had the skill to do that). This means that you have another chance to catch them downloading and/or running `ps`/`nslookup`/`bash` or whatever tooling and flag the suspicious behavior. Whether you're in a VM or not, containers are another line of defense, and that's almost certainly a good thing.
It certainly was but we fixed it and it's not a problem anymore.
> Sounds like you could use some more of that "good devops process" you had when you set up the deploy machinery.
Yeah, there are people within my group that want to 'modernize' things and put them into containers willy nilly for no real reason.
We have already solved all the difficult problems that containers are supposed to 'save' us from. Many of the proposed containers would just be a single statically linked binary with a config file.
FYI, our stuff is hosted internally, so security considerations are not such a big deal.
To hear these container advocates, you'd think that till they came around no one ever managed to use linux.
I'm fully expecting linux userland tools to go away, to be replaced by custom 'distributions' with only a kernel and a docker API soon.
They're already here!
- CoreOS Container Linux (now owned by Redhat)
- Kubic (more focused on running Kubernetes, but same idea)
There are also tools like Linuxkit which focus on helping you actually build images that run the containers you want and nothing else @ startup, which is pretty cool I think.
MicroVMs start faster and provide better isolation.
Meanwhile, none of this is relevant unless you're building your own cloud platform, which is a huge waste of time for most companies.
MicroVMs, containers, VMs, zones and bare metal are places to execute code. Serverless makes all those distinctions irrelevant.
Sorry if you spent 2015 getting really into Docker. You bet on the wrong horse. It's OK, this happens in tech.
Edit: if it's unclear, I don't mean Docker itself is the wrong horse, I mean containerisation tools per se are the wrong horse - and a bad place to invest your time unless you work for a cloud provider
Docker is far from betting on the wrong horse. I can build a docker container and deploy/orchestrate it however I wish - via a docker runtime, kubernetes (which can vary in its underlying implementation), microVMs, VMs, or bare metal.
But then the containers don't need to be docker either. Theres many options available there too.
Either way, there are many options available. And the horse(s) are still in the running.
As for "serverless" vs containers/other, if you can provision accordingly, containers/other can be a better option at scale. Horses for courses, I guess.
Again, most people don't need to build their own cloud platform.
Being able to reuse Docker files is nice, you've reinvented CFengine for the eighth time. That's great.
The point is Docker is yet another standard way to build boxes and contain them. Serverless / FaaS platforms obviate that need.
* obviously an isolated kernel is better than a shared one, but the point of this post (and this comment) is that Docker vs VM vs MicroVMs vs zones etc is irrelevant for companies who are not cloud providers
It does isolate a process's view of the filesystem, it does isolate a process's view of the PID namespace, and that is valuable. Let's say it's a spectrum:
raw processes -------- processes with a certain user ---------- namespace+cgroup isolated processes ---------------- VMs
Docker is certainly an improvement for processes that you don't want running completely contained but want somewhat isolated. For isolation in every sense, you want a VM (a lightweight one if you can get it, i.e. some stripped down qemu).
> MicroVMs start faster and provide better isolation.
Agreed on isolation -- VMs are hands down better at isolating, but I'm a bit skeptical about starting faster.
> Meanwhile, none of this is relevant unless you're building your own cloud platform, which is a huge waste of time for most companies.
Again, this is exactly why it's relevant, 99% of people aren't building their own cloud platform, so they don't need the full isolation of VMs -- most of the time they're just trying to prevent program A that their devs wrote from clobbering program B that their devs also wrote when they both run on the same machine, and making it easier to deploy the dependencies that come with each.
> MicroVMs, containers, VMs, zones and bare metal are places to execute code. Serverless makes all those distinctions irrelevant.
OK I don't even really know what this means, you know serverless runs on MicroVMs right? and most of the time it's actually containers in MicroVMs? MicroVMs are just stripped down versions of regular VMs, and no one is in a zone unless they're running Solaris.
The distinction between these things is still very important, unless you mean that the future is everyone just deploying functions for their applications? But even if you mean that, cold start is basically the first stumbling block you see and it literally exists because of the distinction between how these technologies work (and how fast they can be started on demand, with how much isolation).
> Sorry if you spent 2015 getting really into Docker. You bet on the wrong horse. It's OK, this happens in tech.
Is there anyone that spent 2015 really getting into Docker and isn't better for it now? The technologies that have sprung out of this part of computing are very valuable to know and are getting more valuable, not less. You don't have to install a VM to run an isolated postgres instance on your dev machine because containerization exists -- if you're still doing this you should probably look into updating your tooling.
Also, don't forget that containerization is how some of the richest and supposedly best (due to their ability to spend money on engineers) companies in the world have been handling deployment for nearly a decade -- 2015 is late for realizing containerization is a good thing, not early.
Indeed. And since we're not building our own cloud environment, because that's a waste of resources for most companies, we will share an environment with possibly malicious code so therefore need isolation.
> unless you mean that the future is everyone just deploying functions for their applications?
Yes, that is exactly what I mean.
> You know serverless runs on MicroVMs right?
Yes, that's why I mentioned them
> and most of the time it's actually containers in MicroVMs?
I doubt this - AWSs performance documentation focuses on MicroVMs as an alternative to containers, not an addition. Which makes sense as containers do less than MicroVMs.
But hey, even if the AWS MicroVM documentation is wrong, it doesn't matter. I am not building a cloud platform. I do not care.
You're right about spin up time. Open a socket and let your apps terminate with it open for a greater chance of reuse. As a FaaS user that's your entire concern with your execution environment.
> 99% of people aren't building their own cloud platform, so they don't need the full isolation of VMs
All cloud environments must provide isolation between customers, hence VMs / MicroVMs. Customers adding docker on top of that add a huge administrative overhead that duplicates the features of their cloud provider for littl benefit,
> Is there anyone that spent 2015 really getting into Docker and isn't better for it now?
Every single person whose product is not Internet infrastructure, that wrote or configured their own unnecessary custom LXC or Docker and VM environment because 'Docker changes everything'.
If a company's product is machine learning for detecting cancer and their ops person has a custom Docker/kubernetes environment they're misusing their employer for their own technical interest
>2015 is late for realizing containerization is a good thing, not early.
It is indeed. It's just that then Docker hype was at its maximum.
OK, I'd like to note that it's also the past -- CGI was (and in some dark corners still is) a thing.
> Every single person whose product is not Internet infrastructure, that wrote or configured their own unnecessary custom LXC or Docker and VM environment because 'Docker changes everything'.
Yeah but those people now have way easier to run local environments?
> If a company's product is machine learning for detecting cancer and their ops person has a custom Docker/kubernetes environment they're misusing their employer for their own technical interest
???? If your company's product is machine learning, and developers who must work on that product need to set up their environment on their local machines, docker is easier to get started with than VMs, runs faster, consumes less resources. While it might be arguable that it's easier, there are literal money savings to be had by running a docker container instead of a full VM.
Kubernetes has much more complexity and many more tradeoffs involved so I can see that being a much heavier decision.
What you're saying is that overzealous ops people who are looking to pad their own resumes should not be allowed to run amuck, and I agree with that, but docker is not the poster boy for engineering largess -- and I'd argue it never was. Companies and research groups/smaller distributions have been using containerization very productively for a long time.
Yep. Tech does that - think about centralisation / distribution every few years, maybe it'll cycle back to people caring about their own containment tech in future. But not right now.
> Yeah but those people now have way easier to run local environments?
OK, so they wasted their time on their awful custom Docker/k8s thing that runs on top of EC2 anyway, and they have a slightly better way to spin up dev environments?
The rest of the conversation is about dev environments, bare metal, VMs and containers all have their place and I mostly agree with you (obviously containers are only useful when there's a Linux kernel on that desktop, Windows and Mac are bare metal or virtualising for the most part).
> What you're saying is that overzealous ops people who are looking to pad their own resumes should not be allowed to run amuck, and I agree with that, but docker is not the poster boy for engineering largess -- and I'd argue it never was.
You understand my point perfectly. I believe that docker is precisely the poster boy for engineering largess, but this is based on my own experiences (talking to a lot of young engineers in the startup world who love wasting investor money on ops) and it seems reasonable that you have had different experiences.
I think we have a good understanding of where we each come from and can end it here. Thanks for being civil.
However, I would add that for my own personal use, it's invaluable for development work. All that work that you do _before_ your CI or deployment.
1) When I'm working with a collection of tools that I need but are a complete mess with lots of state (think: compiler tools, LaTeX, things like that), then docker image build with its incremental way of running each command piece by piece, and saving the state after each RUN, is actually a life saver. You follow the steps of some instructions, and of course, as usual, there's one extra step not documented in the manual, so you add that to your Dockerfile. You make a mistake, no big deal, just change the command, the bad state is discarded, and you get to try again. You don't have to run the whole thing all over again. And it's instantaneous.
2) When I have to work with a client's codebase, as a consultant, you'd be surprised how many projects do not have a reproducible build, with Docker or anything else. So I end up building my own Dockerfile. The number of times I've heard "but you just have to run this setup script once" -- well, those scripts never work (why would they? nobody runs them anymore). Especially when it begins with `npm` or `pip` -- almost guaranteed to fail catastrophically, with some g++ compile error, or a segfault, or just a backtrace that means nothing. For example, I recently had to run an `npm` install command and it failed with `npm ERR! write after end`. I re-ran the container again, and again once more, and then it succeeded (https://gist.github.com/chrisdone/ea6e4ba3d8bf2d02f491b4a17f...). npm has a race condition (https://github.com/npm/npm/issues/19989; fixed in the latest version). I wouldn't have been able to confidently share with my client this situation unless I had that reproducibility.
3) It's trivial to share my Dockerfile with anyone else and they can run the same build process. I don't have to share an opaque VM that's hundreds of megs and decide where to put it and how long I want to keep it there, etc.
4) It's a small one; but speed. Spinning up or resuming a VirtualBox machine is just slow. I can run docker containers like scripts, there isn't the same overhead.
5) Popularity is a blessing; the fact that I _can_ share my Dockerfile with someone is a network effect. Like using Git, instead of e.g. darcs or bzr.
By the way, you can also do nice container management with systemd and git. There's nothing inherently technologically _new_ about Docker; it's the workflow and network effects; it lets me treat a system's state like a Git repo.
Nix has similar advantages, but I see Docker as one small layer above Nix.
Docker provides an (IMHO pretty buggy) isolation layer that lies between "keeping things that need to be kept separate in separate folders" and "keeping things that need to be kept separate in separate virtual machines".
I actually don't have the need for the level of isolation below VM and above folder very often. IMHO this level only really makes sense when containing and deploying somewhat badly written applications that have weirdly specific, non-standard system level dependencies (e.g. oracle) that you don't want polluting other applications' dependencies.
I've compiled and installed postgres in separate folders lots of times (super easy) and I've lost count of the number of times people have said "why don't you just dockerize that?" as if that was simpler and/or necessary in some way. That's the effect of "docker hype" talking.
The two primary use cases for Docker is, as far as I can see, is simplifying deployment on varying environments. Variations can happen because of many reasons. Sometimes you have clusters of various sizes in production. Sometimes the environment is a developer laptop. And so on.
Or if you have, y'know, a really simple script.
...the kind which also runs inside most semi-complex Docker containers anyway.
I've spent more of my life and torn out more hair dealing with obscure docker bugs than I have converting scripts from one flavor of linux to another.
It's great that you compile postgres but I just want to run it in a clean and portable way, along with several other programs, and without learning new workflows for each one. Docker containers give people more options to package and run software in a simple standardized process while offloading the tedious system details that don't matter. That's progress.
As an ops person myself, docket saved me lots of time.. defeated can run their containers locally then hand then over to me to stand up. As we move to hosted services, I don't even need to maintain a server. My role of shifting from spending lots of time on ansible and monitoring servers to helping look at code and spending more time investigating weird bugs outside the developers capacity.
I was an early Hadoop adopter as well... And I agree with people's sentiment here -- it was a tool looking for a problem (outside it's specific use case). I used it for it's intended purpose, and I have with it too make it a web crawler to. It actually kinda worked in that regard, but it's not the right usage. It might be able to expand into new use cases though.
Docker solves (again) a real problem in the industry that had existed for decades... And The problems solution keeps going back and forth. Nowadays we train developers, not systems engineers (I've been trying to hire a systems engineer for almost a year and have nearly no bites... Or developers positions get 3 good candidates worth interviewing in 2 weeks or less). This means we have lots of available developers and not enough ops people. Containers help shift the burden to work in this dynamic to -- it simplifies the process to get the devs application to work in isolation. This means 1 ops guy could support a dozen developers and 30 apps on one server relatively easily compared to before. It shifts the burden of the developers runtime environment to the developer... We can still step in too help, but when file that environment is codified in git.
I've been an ops guy for a decade and unlike my positional colleagues I love Docker, it's let me focus on more important things.
I can give my coworker a docker image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present.
> I can give my coworker a docker image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present.
"I can give my coworker a VM image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present."
"I can give my coworker a full system container image and it mostly "just work" without failing because she happens to be running a slightly different version of Ubuntu with different system libraries present."
> applications that have weirdly specific, non-standard system level dependencies
Spot on. 99% of software the world needs can be written against libraries provided by OSes. And then packaged properly.
A slightly smarter .tar.gz would have solved the problem just as well.
A container is vastly more powerful for running an application than a tar file.
You can often run daemons as different users and set appropriate file permissions. You can add ENV variables to your start up scripts or configuration files. Volumes are mounted by the system (and you set appropriate access rights again). Monitoring and restarting services is managed by your init system (and probably some external monitoring, because sometimes physical hosts go nuts). Depending on your environment you can just produce debs, rpms, or some custom format for packaging/distribution.
Yes, sometimes you still want docker or even a real VM, and there are good reasons for that - I totally agree. But often it is not necessary. I'm often under the impression that some people forget that the currently hyped and cool tech is not always and under every circumstance the right solution to a given type of problem. But that's not an issue with docker alone...
That sounds exactly like creating a Dockerfile. The difference is that your script has to work any number of times on an endless number of system configurations. The Dockerfile has to work once on one system which is a much easier target to hit. The "any number of times on an endless number of system configurations" is a problem taken care of by the Docker team.
before it was just a mess.
and it also isn't that much older than docker.
Longer answer here https://thenewstack.io/docker-based-dynamic-tooling-a-freque...
My point was that Docker purports to solve the sandboxing and security problems.
In reality, this is something that 90% of people who use Docker don't give a shit about. For the vast majority Docker is just a nice and easy-to-use packaging format.
The sad part is that
a) Docker failed at security.
b) In trying to solve the security problem Docker ended up with a pretty crufty (from a technical point of view) packaging format.
Maybe we need to start from scratch, listen to the devs this time and build something they actually want.
Says who? The article I linked to you says nothing about security.
>Docker failed at security.
If somebody thinks security is the strong feature of Docker he/she is misinformed.
>For the vast majority Docker is just a nice and easy-to-use packaging format
For the vast majority of who? Developers? Sys admins? PMs?
The big advantage of docker is the self-contained environment for CI builds.
It's called "OS package" ;) and can provide more strict sandboxing using a systemd unit file: unit files provide seccomp, cgroups and more.
1) is not a containerisation problem. It’s a team problem. I can jam in a load of npm and pip installs in to a shell install script. Maybe even delete /usr/ for the hell of it. Because the script isn’t isolated from the OS I can cause more damage.
This problem is actually solved by doing code reviews properly and team discussions.
2) errr no. Containers != infrastructure. If you want to deploy on bare metal, you can.
Yes, systemd unit files are containers, just like Docker.
Nix tries to solve this, but it isn't there just yet.
Use the same OS and similar hardware for development and production.
Also means developers can work in whatever environment they want, but the result will be reproducible (almost) anywhere.
This baseless assertion is patently wrong on so many levels. Building computing clusters on COTS hardware is a very mundane problem. Running processing jobs on data shards is a very mundane problem. Scaling COTS clusters transparently is a very mundane problem.
Many people use/used Hadoop for problems that did not warrant the overhead and complexity that comes with Hadoop. I've seen it countless times with my own eyes that people pre-emptively use tools like Hadoop and Spark because of a chance that they will hit a massive scale in the future.
This happens in both startups and enterprises alike: people like to think they have big problems too often.
Worse still, it didn’t even use HDFS and we eventually got sick of the crappy embedded Zookeeper/Kafka setup.
A.k.a. resume-driven development. Having Hadoop on your CV looks sexier than awk.
I just sat there thinking I could probably run what they did on my phone.
And just to be clear - it wasn't a PoC or a demo.
No, it still remains astonishingly wrong. Even container orchestration platforms are being adapted to provide the same service that Hadoop has been providing for years, and no one in their right mind would claim that running processing jobs on the cloud is a problem that almost no one has.
- Fits on one computer (most of the market)
- Fits on several computers (most of the rest)
- Requires a significant cluster of machines (50+ to store it)
Hadoop only really solves the last one. It has huge overheads in terms of speed and in terms of resources and headcount to run it properly, so it only makes sense at a particular scale. It's like a mainframe – most companies shouldn't buy one.
If you add to this the fact that Hadoop was about batch processing, and its "realtime" capabilities were poor, there really aren't that many potential customers, and many of the potential customers would rather run it in-house, or build their own system.
There's one category above that, which is "Fits in memory" and that is a huge chunk of the market. I've seen first hand people getting way too cute and complicated planning for scale, and then it works out that they don't even have more than a couple of GB of data.
Unless you're storing media, or you are truly "web-scale", your business data will very likely fit in 512GB.
I've heard of a company refusing to purchase an external drive for an employee so they could process a handful of ~50GB datasets on their MacBook Air – instead forcing them to use "the cloud" or constantly download and backup datasets.
I've heard of companies doing extensive work to set up Hadoop to process a few GB.
Roughly I'd suggest that "fits on my laptop" is <1TB, "fits in memory" is < 1TB, "fits on one computer" is < 10TB, "fits on a small cluster" is <100TB, and "might be worth Hadoop" is >100TB. I could be too low on these though.
From Heroku's site, Heroku's 4GB database plan goes for 50$/(instance.month) while Heroku's 8GB plan goes for 200$/(instance.month).
Therefore, it isn't a question of if it makes finantial sense (it does) but how long the startup plans to operate to recover their investment.
> I've heard of a company refusing to purchase an external drive for an employee so they could process a handful of ~50GB datasets on their MacBook Air – instead forcing them to use "the cloud" or constantly download and backup datasets.
I find it rather strange how someone believes that it's a decent idea to conduct a company's data analysis work on what an employee manages to fit on an external HD, as it creates a whole lot of hard problems both legal and technical. I mean, how do you ensure the data's provenance is tracked and other data analysts can access the data? Who in their right mind would put himself on a situation where a minor lapse or misfortune (losing/getting the HD stolen) could put the company at risk?
> Roughly I'd suggest that "fits on my laptop" is <1TB, "fits in memory" is < 1TB, "fits on one computer" is < 10TB, "fits on a small cluster" is <100TB, and "might be worth Hadoop" is >100TB. I could be too low on these though.
That's a rather naive and missinformed take on Hadoop. Hadoop might be conflated with big data but it's actually a distributed system designed to reliably process data shards without having to incur a penalty to move data around. It makes absolutely no sense to base your assertion on data volumes alone. What matters if it the performance increase justifies setting up a hadoop cluster with the resourses available to a company.
Now making (big)money with its ecosystem is another question.
Something that can handle hundreds of terabytes on hundreds of machines and provides useful tools on top of the whole thing (Spark, Hive, etc)?
BTW, "hundreds of terabytes on hundreds of machines" isn't interesting territory any more. Most people's needs are far smaller, so HDFS isn't much help. Those who need more generally need much more, so HDFS isn't much help again. Richer semantics are nice either way. Imagine thousands of machines with dozens of terabytes each and you might start to see the problems with HDFS's design (though you'll still be far short of the domain I work in).
And I am curious about possible alternatives: open source, about 100-200 machines, with good support for analytics and SQlish systems.
So there should be some money there.
Nope. K8s is a cluster operating system that happens to fit very well with the microservices architectural model. Packer is a way to create classic VM imgs (this one is similar-ish to Docker but only if you only care about the 10000 feet non-technical-at-all image. Ansible is an infrastructure as a code tool. You deal with mostly classic infra components and compositions of them as code. Putting all them in the same bag is like saying that all programming languages solve the same problem. It's only true if we cut the conversation down to a level where we consider all digital devices as the exact same thing.
I'm genuinely curious, if you want to search over big data, which should be a pretty common procedure these days, what alternatives are there to a distributed file system? A dfs seems very complex to me. And it is not clear too me what alternative system designs a dfs will outperform. Is a dfs the only solution to big data?
Relational DBs do break down at a certain scale. What system do you turn to next? Nosql? Will that scale infinitely? Will any system scale infinitely?
My guess is that almost all programming jobs are in fields that produce no more data than a gb or two a month.
I'm happy to be proven wrong, but I would guess that there are far more companies making project management software, time tracking apps, invoicing software, etc. than there are facebooks, googles or reddits obsessively logging every user mouse twitch.
And that's data that's much better sitting in a nice, normal, relational database.
Yes, it definitely seems with market leadership comes big data. Seems to me big data is highly relevant. More relevant now than ever.
I.e., if you are generating reports, running aggregations over a large amount of data you definitely need some parallelism and Postgres isn't designed to handle these loads (certainly not petabytes). Even aggregating 100's of GB probably requires (or at least is more cost effective using) multiple machines.
Now Hadoop may not be a particularly efficient solution unless you need 100's of machines. But there is a limit to what a non-parallel single machine database can do. There are other solutions in-between.
And you really don't have to be twitter or google to handle significant amount of data these days. People are recording much more data in the hope of generating new insights and do need tools to process that data.
Aggregating 100s of GB isn't much of a problem for PG these days. Yes, you can be faster - obviously - but it works quite well. And the price for separate systems (duplicated infrastructure, duplicated data, out-of-sync systems, ...) is noticable as well.
But yea, for many petabytes of data you either have to go to an entirely different system, or use something like Citus.
Disclaimer: I work on PG, and I used to work for Citus. So I'm definitely biased.
But I honestly don't think hundreds of users each querying 100s of GBs is all that common.
Even Postgres got this right recently with the introduction of the BRIN index, which is a lot more lightweight.
Look at Netezza, Oracle Exadata, and (disclaimer: I work on this) SQream DB, which can absolutely handle hundreds of terabytes without too much fuss.
First, NLP based search can be executed on top of any engine (APIs are very handy), relational, kv, graph, filesystem .. so that part is totally irrelevant.
Assuming "big data" in this context is still relational data, then any of those systems would suffice, within their own particular tradeoffs and features.
If you're talking about taking some questions and getting graphs from them, ThoughtSpot does a good job.
> if you want to search over big data, which should be a pretty common procedure these days
It's not a common procedure by any stretch because the vast majority of datasets aren't really "big".
Then again it's called venture capital for a reason so this isn't exactly unexpected. The question should really be more about the scale and hype that was involved.
Btw. there are many very successful startups in the big data space that understood the limitation of Hadoop and addressed almost every if not all aspects of its shortcomings. A good example would be Snowflake computing.
How are they then querying over big data these days?
We don't know, do we? Or did they open-source their search engine?
By using Hadoop people are trying to not reinvent the big data wheel, partly because it's a motherfucker of a problem to have to solve and party because they want to solve the business problem, not the technical one. I don't see how that is in any way worthy of being frowned upon.
Map -> map, filter, flatmap, etc
Reduce -> reduce, joins, folds, group by, etc
Those other concepts were always expressible as map and reduce, of course, just with a bunch of annoying repetitive work
Here is the abstract: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient dataparallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google" .
I bought a Raspberry Pi and using a few commands, installed pre-configured Docker ARM images for 8-9 different media applications that would've taken me days to setup and manage individually. I didn't have to worry about dependencies or compilations. It just worked.
And the convenience does not come free - a random docker is almost as bad as a random executable.
Then, as a matter of convenience, you get packages, that either bundle everything preconfigured to their liking (i.e. gitlab omnibus) or packages, that configure shared services how they need it, and running them shared is not supported (i.e. freeipa, which will configure httpd as it needs and forget having anything else served on the same machine).
Docker provides a way to isolate these, so you can still use the same resources you have to run applications, that would not cooperate with each other on a single machine, without having to run separate OS instances in separate VMs.
It is possible to run other services of such configured httpd, but you need to be careful. When freeipa wanted mod_nss, you had to use mod_nss, not mod_ssl (though they switched it); when freeipa wants to use gssproxy, you are going to use gssproxy too. These changes can happen during upgrades, and it is up to you to fix everything after such change.
The project doesn't recommend to run anything else with the same server; you are free to try though.
The point was, that with docker or another container system, any such problems are irrelevant, and it allows you to have separate service instances without having to run separate VMs.
One of the reasons Docker is so popular is that it bypasses that garbage fire.
And the ppa juggling is somewhat accidental historical complexity, and some trust management issues.
Non-official dockers in widespread use will, I believe, explode as a security nightmare sooner than later.
I am yet to see a single project which is relatively new and is packaged on Debian (let alone Raspbian). And lets be honest - nobody is genuinely using Raspberry Pi to run an outdated media server.
Raspbian might not, I don’t follow it.
You can sandbox a random executable with seccomp, you cannot effectively sandbox a whole container without breaking many things.
Yes, Docker is a shitty solution to just packaging applications, but it exists because Linux developers keep saying "apt exists. It's solved! What's the problem? Static linking? Why would you want that? Portable binaries? But apt is the only place you need to put your app so you don't need them!"
now, on bare metal server and premium providers where you're able to specify the boot image that wouldn't have been a problem, but there are lot more people dealing with cheap housing giving you whatever image they baked last than people doing bare metal deployments and for the former docker is an order of magnitude easier to deploy
"other VMs"? The whole point of Docker is that it's not a VM...
Often the applications are packaged by random people on the Internet and do not receive security updates.
There's plenty of evidence showing how bad the problem is and there's no way around it.
You need the security team of a distribution to backport security fixes into a stable distribution and a large user community test them.
Only with this you can run apt-get upgrade without breaking things.
I’m making no comment on the specifics of the Docker or Hadoop ecosystems as I have no skin in either game but history is full of useful tech that didn’t make money.
However, docker internals have basically been replaced with containerd in this point -- the two front runners in the battle to actually run your containers (and power higher level abstraction tools like docker) are containerd and cri-o.
I personally prefer containerd, but there are a lot of people who are obsessed with cri-o (big company backers, from what I can remember), despite the fact that it's chronically behind on features (for example alternate runtimes, runtimeClass support), but they're both excellent.
Note that there are also other projects like podman that also aim to serve as docker replacements.
Discussion on containerd shim in docker can be found on google group way back in 2016.
Given a mediocre number of physical machines (say, 40), which is easier, install, setup and maintain a VMWare cluster or a Kubernetes cluster?
If someone has any insight, preferably backed by actual experience, it'll be most appreciated.
If you want to run 200 different services on 40 machines, you may find hand crafted VMs easier to create and forget.
If you want to run 10 services with different levels of replication and redundancy on 40 machines, then Kubernetes will do that for you.
Or to put it another way: There must have been some few Hadoop investments that worked out, the same will eventually be true of Docker.
And instead of fulfilling such dramatic hype they're both just good tools that are far from universally needed, not objectively superior to all other options, and there's nothing special about them that will keep them from getting supplanted by newer tools, which is the norm for even the industry even if the tools are good.
But yes, there's no reason to think that the distribution of successful, neutral and failure returns for Docker centric startups won't follow the usual distribution.
What are these supposed "hard problems" the author speaks of?
I hope WebAssembly goes in this direction, instead of trying to adapt to current programming language paradigms.
But this isn’t the problem Docker is trying to solve. It’s just a problem that Docker needed to solve in order for their product to be useful, this is completely transparent to Docker users. Docker abstracts away a whole bunch of work you’d otherwise have to do to implement repeatable builds, it makes those builds widely distributable, and (depending on how you choose to use containers) can also simplify some capacity planning problems.
They are definitely not reproducible in the sense of building bit-for-bit identical containers, unless you use Bazel.
That being said, I've found Dockerfiles to be a much more reliable build process than most others (recently struggled to get through LogDevice's cmake-based build.. ugh).
Containerised applications are commonly used. At this point it's a proven technology with clear use cases.
As a result I find it difficult understand the hype.
Is that the reality of today?
Personally I too feel that distributed computing is an overkill for most 'big' data problems.
But even if you "collect everything and sort it out later", in my own personal experience and in what I've read here on HN, you can go a long long way before you need to reach for the power tools. What most companies call "big data" is typically not that much (in quantity and in velocity). Most companies don't have tens or hundreds of terabytes of data. For example, I'm currently processing timeseries data in postgres using the timescaledb extension, which makes it perform very well. Still too early to state numbers, but its looking promising so far and if their claims are true, then I won't need anything else. We will see :)
And it really only makes sense to have permanent infrastructure for distributed computing if you're constantly using it. Like if you're constantly rebuilding an index of the internet. Which most people aren't.
For occasional reindexing jobs, I've personally had success with Kubernetes. Our customer facing services were deployed with it, so it was trivial to bump up the number of nodes, and then schedule a bunch of workers containers, then roll everything down when the job is complete. No need to learn the ins-and-outs of Hadoop.
In telcos you have network telemetry data. In supermarkets and retail you have purchase data and often credit card gateway data. In banking and finance you obviously have transaction data.
And with Kubernetes you still need a compute framework. Like I don't know the standard in the industry the Hadoop Spark framework.
Note I'm not saying all. Just 99%, like the parents comment referred to. That leftover 1% are the companies you can name off the top of your head. ExxonMobil, Target, Chase, Visa, etc.
If you have data that is actually big (= doesn't fit onto a machine), then Hadoop is a reasonable candidate. Otherwise you are fine with a lot simpler tools.
This has probably been true for quite some time, and companies are just slowly realizing now that their relevant data isn't actually that big. On the other hand computing power and storage has still grown in the last years, and less resource intensive ETL has become more acessible, so the bar for "big" data has been raised quite a bit.
The idea was interesting, but it didn't quite work out. At some point he left the company and we had to do something about the pipeline, as it was crashing most of the time. We did some calculations, and figured that with the right approach, a good old MySQL+PHP solution would do the trick. And it did.
Having switched jobs myself in the meantime, I'm happy knowing the system is still running, and finding people to maintain it is relatively easy.
It's always easy to pass judgement at technology choices but in my experience they are often made with the best intentions based on requirements that not everyone is aware of.
> Implemented big data / real-time Hadoop streaming ETL service processing billions of requests.
I've become very skeptical of anyone who puts a combination of buzzwords and pseudo-numbers in their resume.
We don't know about the specifications because we don't work there and since the OP said he wasn't there he might not know either. And the fact is that requirements regularly change over time.
The point is that it's really easy to judge when you aren't there and are privy to all the facts.
And you're damn right I'm going to be judgemental of someone not only promised the moon, but abandoned the work when the rocket exploded on the pad and someone else had to clean up.
I know this is circumstantial, but on average there were 2 developers. The company needed something to just get the bills out. It also needed to be correct, stable, easily expandable for new datasets (requiring all kinds of conversions), testable from unit to e2e, integrate with other APIs to retrieve metadata, and .. I forgot a few, probably.
Hitting a performance or stability limit with MySQL with an unsophisticated schema/architecture does not mean you have big data. But that's the scenario that's common.
Even Google, the king of big data, will sample your hits on Google Analytics if your site gets too much traffic.
I guess tell that to the 10s of companies doing PaaS. Or another 10s doing app monitoring / logging. They're successful companies, far from just "breaking even".
Preconfigured clusters, integration with your existing AWS deployments. All that sort of jazz.
Speaking of Hadoop: my vehicle is parked outside one of the HQ's of another top 10 Hadoop startup. It's
one of the most expensive, nearly empty buildings in the highest rent areas of the Valley. (Money flushing sound here.)
Fun fact: one of the enterprise Hadoop CTOs is a broney.
Be a pretty awful world if everyone took your advice.
Whenever a new big piece of tech comes out, most of the detractors seem to either a) find a use case that isn’t fit for purpose or b) try to use it without bothering to learn how, and the proceed to say ‘see, it’s not all it’s cracked up to be’.