Hacker News new | comments | show | ask | jobs | submit login
Show HN: Posixcube, a shell script automation framework alternative to Ansible (github.com)
53 points by schizoidboy on Jan 12, 2017 | hide | past | web | favorite | 53 comments



A similar tool I switched to from ansible for my personal project is https://pressly.github.io/sup. Ansible felt heavy and restrictive for my small project (lots of abstractions to learn and APIs on top of that for extensibility if those abstractions don't meet your needs).

sup on the other hand has a very small surface area of things needed to learn if you're familiar with shell scripting. I would definitely recommend it for smaller projects.


I like a lot of things about this including, first of all, the website describing it, and the screencast, and easy-to-follow instructions. It doesn't meet my needs including handling of encrypted variable substitution in file templates, but thanks for sharing.


Looks like exactly what I need, thanks for sharing!


This looks great, thanks for that!


What I needed too. Thanks.


Hrm, personally I'd prefer a better language than Bash.

I was annoyed by all the structure Ansible, Chef, etc impose, so now I use Fabric [1]. It's for a few personal servers, and for that it works just fine.

[1] http://www.fabfile.org/


I agree, bash is not that enjoyable. Not only that, but the standard binaries in Unix tend to have non-consistent arguments and in general seem like a clusterf--k to me. I know, it's my knowledge and my fault, but it doesn't mean i have to think they're somehow elegant in UX, it's terrible, imo.

With that said, i still prefer tools like Sup that just stick to bash & unix. As much as i hate bash & unix, not using it requires learning yet another API that probably will end up with either missing edge cases, or complexity due to edge cases. Unix is the way it is because the "problem" is complex, not because the devs involved were incompetent. An abstraction ontop of Unix is just another API, for me at least.


I agree shell scripting is a very limited language, especially when trying to keep them POSIX compliant.


I believe deployments should be described as declarative data, not as imperative code.

Declarative: can first be tested, verified, dry-run, and implemented in any language. salt (my preferred CM) has yaml data files that can be analyzed and verified long before any system command occurs, for example.

Imperitive: You don't know what it does unless you run it. You don't even know if it will execute without syntax error, especially in a dynamic language like bash (posixcube) or ruby (chef, et al).


Let's say I want to whitelist a set of IP addresses in the firewall configuration. Let's assume I know the commands to do this using firewall-cmd. Below is how I would do this with posixcube. Can you give an example with a declarative system of your choice for comparison?

  # Loop through the list of servers passed in
  for cubevar_app_server in ${cubevar_app_servers}; do
    # Get the IP address of the server
    cubevar_app_server_ip="$(dig +short ${cubevar_app_server} || cube_check_return)" || cube_check_return
    
    # Skip ourselves by comparing to our hostname
    if [ "${cubevar_app_server}" != "$(cube_hostname)" ]; then
      if ! firewall-cmd -q --zone=trusted --query-source=${cubevar_app_server_ip} ; then
        cube_echo "Adding ${cubevar_app_server}'s IP ${cubevar_app_server_ip} to trusted whitelist"
        firewall-cmd --zone=trusted --add-source=${cubevar_app_server_ip} || cube_check_return
        firewall-cmd --permanent --zone=trusted --add-source=${cubevar_app_server_ip} || cube_check_return
      fi
    fi
  done
This would be run as follows (where the -o parameter is expanded from ~/.ssh/known_hosts):

  posixcube.sh -h $TARGETSERVER -o cubevar_app_servers=*.mydomain.com -c firewall_whitelist


Generally the ip addresses would be selected by host groups (eg: all) - and if "firewall-command" was the canonical way to add rules, that'd probably be what eg Ansible would run on each platform.

An argument might be made that this would live better in a (templated or not) config file in version control - and then the action become: fill in template, push config, (re)load config.

In this case, for Ansible, you'd probably try to use command (there's also she'll, but simple is better - when possible): http://docs.ansible.com/ansible/command_module.html


I want to see a fully fleshed out example for comparison. When I find some time, I plan to take this use case as an example, implement it in all major automation frameworks, and publish the comparison.


Right, I suppose for Ansible the idiomatic approach is to write/use a plugin - like:

http://docs.ansible.com/ansible/firewalld_module.html

Expanding the white-list of ips is a matter of config/templates.

Fwiw, I do think Ansible tend to get a bit complicated because plugins tend to get a bit opaque and complicated. On the other hand, some simple special variables like "all_other_servers" to avoid manually filtering out "self" procedurally all the time is probably a good idea.

I'm not sure if I'll be able to spare the time - but I'll at least watch the github repository - perhaps I can try and help flesh out some alternative examples if you file an issue?

[ed: perhaps such things should go here?

https://github.com/myplaceonline/posixcube/issues/4 ]


Agreed. I referenced this discussion in that issue and hopefully soon I'll get around to building it (would be wonderful if you could help). Thanks for the pointers; they'll save me some time learning Ansible.


The impetus for this project was that I wanted to move away from Chef because it was costing me an extra administrative server per month. I wanted to try something new, so I didn't bother looking into chef-solo. I wanted something agent-less, and it seemed like Ansible was the latest alternative, but when I looked at its YAML, I felt fatigue at learning yet another domain specific language. I thought, "Hey, why not shell scripts?" (insert jwz RegEx joke here). I checked search engines and Stack Overflow, but I couldn't find much except for FSS which didn't meet my requirements, and I had always wanted to learn more about shell scripting, so I created my own framework called posixcube.sh (open source, MIT license): https://github.com/myplaceonline/posixcube. There were a few requirements that I had from my Chef cookbooks:

  * Idempotent file templating with variable substitution
  * Plain-text and encrypted variable files
  * Logical packaging of recipes
  * Roles and a way to check for roles in scripts
  * Repeatable sets of actions for certain types of servers
So beyond the basics in posixcube.sh of uploading itself to the remote host and providing a consistent API, the above requirements were solved with: cube_set_file_contents, gpg for encryption, "cubes" for logical packaging (although loose scripts and straight commands are also supported), -r ROLE, and cubespecs.ini, respectively.

After having migrated an entire Ruby on Rails architecture (https://github.com/myplaceonline/myplaceonline_posixcubes) to posixcube, I thought it went very well, although I certainly see some of the downsides of shell scripting such as checking for error codes in a pipeline, strange HEREDOC interpolation, etc. I also see some of the benefits of complete idempotency, but I didn't find the additional procedural checks to emulate idempotency (outside of file templates) that difficult. For the simplest cube example from the previous link which configures an NFS server, see https://github.com/myplaceonline/myplaceonline_posixcubes/bl... and for a more complex nginx+Rails cube, see https://github.com/myplaceonline/myplaceonline_posixcubes/bl...

You can test it as simply as:

  git clone https://github.com/myplaceonline/posixcube
  cd posixcube
  ./posixcube.sh -h $SERVER "cube_echo Hello World"
For more complex examples, see the usage. I'm open to any feedback and any philosophical debate on this approach versus Ansible, Docker, etc.


I've been developing something eerily similar myself, which amounts to 41+148 lines of (two) minimalist shellcheck'd bash scripts, providing me with a dead simple way to push and apply a unit (or group of units).

This is merely pushing some dumb shell scripts (units) that we write to be idempotent, and "dependency" order is handled by a flat file (groups) that references the scripts (units) or other flat files (groups), running each one in sequence (with silent+logs, or verbosity) in its own shell (so that each unit can't leak env vars to the next one) that sources a very small library of tool shell functions (currently three functions that tally up to 41 lines atop the 189 mentioned above).

The whole thing is so small, simple and approachable that it just flies in the face of puppet/chef/ansible in terms of productivity at our scale and for our problem domain.

I didn't even consider opening it given how small it is but I guess it could turn out to be useful to some nonetheless.


Please share if you can, I'd like to take a look.


I can't criticize someone rolling their own foo for learning's sake (or just for the fun of it), but I think if you'd given Ansible a try, you would have been happy with it.


Ansible has its strengths, but there's something to be said for the simplicity (and auditability) of shell scripts.

And frankly, after having some very sensible patches rejected (this was a few years ago, before the acquisition), I got fatigued of the entire project.


Ansible is nice, but it has some annoying quirks:

- YAML quoting.

- Quasi Python expressions from Jinja 2.

- Which fields are expressions and which ones are strings is not always clear.


Agreed, although I will say the templating has improved over the past year to be more consistent and full-featured cf. Jinja 2 proper, and they are now enforcing interpolation instead of inferring expressions (e.g. if foo is an array you have to write `with_items: "{{ foo }}"`; `with_items: foo` was deprecated, and is an error as of Ansible 2.0 IIRC). The quoting issue is still there, although once you learn the two or three rules it really isn't a day-to-day issue, and Ansible tries to warn you if and when it sees something problematic.


I spent hours trying to get ansible installed properly on arch linux. It just wasn't meant to happen. So I wrote a shell script. The thing is it was easy to get it working on OS X.


I use Ansible on Arch Linux to manage my coreos and Raspbian based Kubernetes cluster. It was just a `pacman -S ansible` away, and installed super easy. What were the problems you ran into?


Yeah, I'm sure Ansible is good and I'm sure I would've liked it as much, or more, than Chef.


Have you considered integrating it with a cloud provider account? So that $SERVER in your example could be a DO droplet name, or a hostname from "google cloud compute instances list" or "aws ec2 describe-instances" commands.


Thanks, great idea. There's already optional Bash programmable tab auto-completion integration which auto-completes hosts from ~/.ssh/known_hosts (the auto-completion is installed using the `-b` option). I've created GitHub issues for integrating DO, Google Cloud, and AWS [1, 2, 3]. I'll start work with DO first since that's what I'm using now.

[1] https://github.com/myplaceonline/posixcube/issues/1 [2] https://github.com/myplaceonline/posixcube/issues/2 [3] https://github.com/myplaceonline/posixcube/issues/3


On one hand, if you're managing a few servers with shell scripts, you'll invariably end up with a framework of sorts. On the other hand, after a few years playing with bash and posix shell -- I'm fairly certain I'd prefer bash, to posix shell -- it seems rather unlikely you can easily use the same set of scripts for BSD, open solaris and various linux distributions -- if your target is linux, pretty much all distributions ship with bash in base install (just remember to use proper "bang" of #!/usr/bin/env bash or equivalent).

And shell script is a fickle language -- I'd at least consider using "trap" and/or "set -e", "set -u" -- maybe "set -p" rather than "echo 'what I am about to do'; what I am about to do". See eg: http://redsymbol.net/articles/unofficial-bash-strict-mode/

I agree that ansible is far from perfect - but I'm not convinced posixcube is a great alternative right now.

From a brief look at the code, it does look nicely structured and commented, though!

I've played a bit with the idea of doing something similar (mostly because I'm not convinced all the complexity present in ansible is necessary for my use case -- which appear to be similar to your use-case -- a few small simple server configs).

I've seen people here on HN recommending some makefiles for deployment (along with simple shell scripts) -- and I've played a bit with using m4 for templates rather than here-documents -- and it looks like that might work fine. Does mean your base-installs/bootstraps/images will need m4 (or "compile" the templates on the workstation you run your automation from).

But right now, I feel GNU guix is the right approach: have a small, minimal, vm/container host (right now I'm playing with Ubuntu and LXD/LXC - Xen and KVM also works nicely), and have all images be managed by something like guix.

That reduces the need for additional management and automation to the install of hosts - possibly pxe-installs of guix if you need/want to run some hosts on bare metal -- and should simply management by having system level rollback etc.


> On the other hand, after a few years playing with bash and posix shell -- I'm fairly certain I'd prefer bash, to posix shell

What are some bashisms that you find particularly useful? I wonder if I can just reproduce them in POSIX. So far, I haven't found POSIX too bad, although I don't have much experience making huge and complicated shell scripts. posixcube does support some optional Bashisms already like tab auto-completion and `caller`.

> And shell script is a fickle language -- I'd at least consider using "trap" and/or "set -e"

I didn't like `set -e` because it disallows using "benign" error codes. For example, one of the most commonly used APIs in posixcube is cube_set_file_contents. You pass the file you want to update and the source file. If the file is updated, it returns 0; otherwise, it returns 1. This is commonly used like this:

  if cube_set_file_contents "/etc/npt.conf" "templates/ntp.conf"; then
    cube_service restart ntpd
  fi
I hadn't considered `set -u` and it's part of POSIX. My initial reaction is that it's a nice feature, so thanks for pointing it out, and I might set it.

pipefail is a Bashism, and I'm torn on this. Handling errors in a pipeline is one of the worst parts of POSIX shell scripts. However, using something like pipefail means exiting with an error without a stack. I think it's always useful to have a stack to quickly jump to where the error is in which script. So I still prefer this guideline:

  There are non-standard mechanisms like pipefail and PIPESTATUS, but the standardized
  approach is to run each command separately and check the status. For example:
  cube_app_result1="$(command1 || cube_check_return)" || cube_check_return
  cube_app_result2="$(printf '%s' "${cube_app_result1}" | command2 || cube_check_return)" || cube_check_return
> I agree that ansible is far from perfect - but I'm not convinced posixcube is a great alternative right now.

Please see my reply to another comment in this thread, asking how a simple firewall whitelist configuration would be done. I think this would be a useful example to debate around.

> From a brief look at the code, it does look nicely structured and commented, though!

That's very nice to hear, thanks! This was my first time doing anything hardcore in shell scripting.

> I've seen people here on HN recommending some makefiles for deployment (along with simple shell scripts) -- and I've played a bit with using m4 for templates rather than here-documents -- and it looks like that might work fine. Does mean your base-installs/bootstraps/images will need m4 (or "compile" the templates on the workstation you run your automation from).

I really like that posixcube has 0 requirements, but I'm not philosophically opposed to requirements, so I'm certainly curious what a make-based system would look like. Does make have imperative programming like shell scripts?

> But right now, I feel GNU guix is the right approach: have a small, minimal, vm/container host (right now I'm playing with Ubuntu and LXD/LXC - Xen and KVM also works nicely), and have all images be managed by something like guix.

I'm not aware of that, I'll check it out.

Thanks for the comments.


> What are some bashisms that you find particularly useful? I wonder if I can just reproduce them in POSIX.

Mostly [[ vs old [/test, advanced parameter expansion/substitution, arrays and maybe local variables.

It's been rather long since I attempted to get some dark magic working in posix shell - so I'm not entirely sure.

Either way, any proper scripting language will be more ergonomic to work in when you've reached beyond trivial (which posixcube already has).

There are simply too many anti-patterns to check for - too easy to miss quoting etc.

Python isn't great for (posix) system automation - but it's pretty good. And now I can trust all my systems to have python3. Then I feel it a bit hard to use shell beyond simple "command1, command2, command3"-type simple automation.

That said, it can be great fun to weave pipes and inputs in she'll, when you get it working. But as you've pointed out error handling with complex pipes isn't very good - and as always it's better to keep things simple, when possible.


Yeah, I'll add some of the most common parameter expansion/substitution, string functions, array functions, etc. as APIs.

I had a thought yesterday for handling pipeline failures by using set -e in a sub-shell. I still need to test this, but something like:

  (set -e; command1 | grep foo | grep bar | sed ...) || cube_check_return


> I wanted something agent-less [...]

Sorry to break it to you, but you have an agent anyway, except it's the same fragile channel for debugging, administrative access, and running scripts, so once you start to need to configure it, you can break all the things at the same time.

Oh, and you just dropped proper configuration management. Now you need to add a task queue (another thing that can break and overflow and what not) to re-send configuration commands when some server is down, which would not be a problem in the first place with appropriate architecture of the system.


> just dropped proper configuration management.

Unless I'm not understanding what you mean by configuration management, I don't see what prevents the "agent-less" setup from having this. Store the config in VCS, or if it has secrets, some ACL'd storage system that's accessible by your laptop/whatever is actually running the deploy.

Agent-less here, at least to me, means not having a server that the hosts phone back to, and instead having the deploying machine call out to the servers with SSH. It's just connecting in the opposite direction, but allows you to connect from wherever is most convenient. SSH is built for running commands, and a lot of the existing stuff seems to just reinvent that.

> Now you need to add a task queue (another thing that can break and overflow and what not) to re-send configuration commands when some server is down,

If a server is oddly down that I expect up, I'd much rather the task failed outright, as I've clearly expected the machine to be up (by attempting to have it execute something) when it is not. Time to for a human to re-evaluate the situation, not retry.

It also isn't clear to me that all deployment tasks are idempotent, and safe to retry. Most that I've dealt with are, but only most, not all.


> Agent-less here, at least to me, means not having a server that the hosts phone back to, and instead having the deploying machine call out to the servers with SSH.

So instead of a single place where your machine gets configs from you have potentially multiple machines that appear and disappear in an ad-hoc manner. Such setup is harder to maintain (and harder to defend security-wise) than a setup that has designated master.

> SSH is built for running commands, and a lot of the existing stuff seems to just reinvent that.

You're mistaken. SSH was not built for running commands, it was built for providing interactive shell access -- and for this works good.

Running administrative commands through SSH is just silly. With the tooling we have there's not even proper error reporting (how to tell that a command reached its target server but failed there? you can use heuristics of exit code at best).

>> Now you need to add a task queue (another thing that can break and overflow and what not) to re-send configuration commands when some server is down,

> If a server is oddly down that I expect up,

And if it was planned downtime? "Tough luck"?

> I'd much rather the task failed outright, as I've clearly expected the machine to be up (by attempting to have it execute something) when it is not.

You'd much rather the server be monitored, with other systems being resilient than failing every now and then for random reasons and needing special additional actions to be taken to recover them.


> Running administrative commands through SSH is just silly. With the tooling we have there's not even proper error reporting (how to tell that a command reached its target server but failed there? you can use heuristics of exit code at best).

The exit code of the remote process is exactly what you should use; I'm not seeing what you think is heuristical about it. Either you get back that the remote process failed, in which case you know it failed on the remote, or you get back success, in which case you know it succeeded, or you don't get anything back at all, in which case you just don't know. This isn't particular to exit codes per-se, any network message to a remote machine is going to be subject to the network or remote host possibly dying between sending it and receiving confirmation of whether or not it was received and had any effect.

If you're using SSH-the-process, it's going to mux things like network failures into that error code. You can either use a library (in a language a bit more robust than Bash) so that you know you're looking at the exit code of the remote process. If you must use SSH-the-process, you can always output the exit code of the remote process that you're interested in into stderr, in a manner that you can parse it back out on the receiving end.

> And if it was planned downtime? "Tough luck"?

Again, yes, though I don't see the "Tough Luck" about it; if I'm attempting to run some unknown and arbitrary command on a machine that was planned to be down, then yes, I need to step back and realize that as a human, and make sure that I'm doing the right thing. (Again, it depends on the command; what would be the effects of re-running it, and can you programmatically guarantee that re-running is a safe operation?)

> You'd much rather the server be monitored, with other systems being resilient than failing every now and then for random reasons and needing special additional actions to be taken to recover them.

Well, yes, I would rather the server be monitored. But monitoring is significantly orthogonal to the system I use to run administrative commands on the machine.

Again, it's not so much about being resilient as it is about not taking a bad situation and making it worst. I've run a lot of administrative commands across tons of machines that were not idempotent, and if any failed anywhere, I would need to intervene. (I also usually tested heavily prior to these, to make sure there was a decent chance of not needing to intervene later.) But that still, to me, implies that whether a given command should be retried until it definitely runs is at a layer higher than the tooling used to distribute commands out to host machines, and in the general case, could quite very well require a human.


> The exit code of the remote process is exactly what you should use; I'm not seeing what you think is heuristical about it.

Lucky you if your SSH client never failed to connect. How do you think it reports a network failure?

> This isn't particular to exit codes per-se, any network message to a remote machine is going to be subject to the network or remote host possibly dying between sending it and receiving confirmation of whether or not it was received and had any effect.

You see, there are tools that allow you to easily tell between network failure and remote command failure. RPC protocols are an example.

> If you're using SSH-the-process, it's going to mux things like network failures into that error code. You can either use a library (in a language a bit more robust than Bash) so that you know you're looking at the exit code of the remote process.

Now let's get practical: find me such a library.

>> And if it was planned downtime? "Tough luck"?

> [...] if I'm attempting to run some unknown and arbitrary command on a machine that was planned to be down, then yes, I need to step back and realize that as a human, and make sure that I'm doing the right thing.

We were talking about configuring servers. You basically tell me that I would have to pay for "agentlessness" with tracking state of each and every machine I would possibly want to reconfigure, and remembering to check if the machine was booted to re-issue a reconfiguration command.

> (Again, it depends on the command; what would be the effects of re-running it, and can you programmatically guarantee that re-running is a safe operation?)

Sorry, I'm lost here: what re-running? The command didn't get the chance to run in the first place. I don't know where did you get the idea that I propose running the command again if it got through to the remote server and failed there.


> Now let's get practical: find me such a library.

Certainly. I'm most familiar with Python, and for that, both Paramiko and asyncssh support returning the exit status of a command, distinct from any network failure.

As I mentioned in the above reply, even with command-line ssh (and any other library, really), if you're creative about how you structure the output on stderr, you can encode the exit status there to be able to differentiate it from ssh returning other errors. E.g., if you pipe the entirety of stderr through, say, base64 and then append at the end a newline and the exit status, you can parse out the exit status on the receiving end. (This is admittedly quite a hack: using a library in Python with direct access to the data is much nicer. The ssh binary is mostly a porcelain interface to the SSH protocol; libraries are a better tool for building on top of.)

> with tracking state of each and every machine I would possibly want to reconfigure, and remembering to check if the machine was booted to re-issue a reconfiguration command.

Perhaps this is just a difference in our production environments. We strive very hard to have all our machines online at all times. (Excepting short outages for rebooting for patches, etc., of course.) We use VMs, so there's not much point to having it be shut down.

> Sorry, I'm lost here: what re-running?

My larger point was about non idempotent commands. If your infrastructure retries commands, are you just very careful to not have non-idempotent stuff? Generally, these things are either "run at most once" or "run at least once"; running exactly once is (in general) impossible, as the command might finish successfully, but the host machine crash prior to recording that success. Again, if you can structure the command to be idempotent, you can work around this, but in my experience there always seems to be some patch or database upgrade that this can't be accomplished for.


>> Now let's get practical: find me such a library.

> Certainly. I'm most familiar with Python, and for that, both Paramiko and asyncssh support returning the exit status of a command, distinct from any network failure.

Oh, that's something new. From what I heard, Paramiko used `ssh' under the hood, but apparently I was wrong.

> ([...] The ssh binary is mostly a porcelain interface to the SSH protocol; libraries are a better tool for building on top of.)

There's still a big problem on the client side itself. SSHv2 is a complicated protocol, so client libraries' APIs are not pretty or force you to use OpenSSH-like home (or both, like Paramiko does).

>> with tracking state of each and every machine I would possibly want to reconfigure, and remembering to check if the machine was booted to re-issue a reconfiguration command.

> Perhaps this is just a difference in our production environments. We strive very hard to have all our machines online at all times.

You must then have very homogenous environment with a small team. Sounds like a startup of sorts. Anything more loose (e.g. more than one team working on the servers, or split between who installs the machines in racks and their OSes on disk and who manages the OSes) and you won't have the luxury of keeping everything up and running all the time.

>> Sorry, I'm lost here: what re-running?

> My larger point was about non idempotent commands. If your infrastructure retries commands, are you just very careful to not have non-idempotent stuff?

And what does it have anything to do with re-sending the commands when they didn't reach their target machine?

> Again, if you can structure the command to be idempotent, you can work around this, but in my experience there always seems to be some patch or database upgrade that this can't be accomplished for.

Uhm... What? Where did you get the idea that upgrading database is managing configuration? It's a deployment operation. A different category.


> And what does it have anything to do with re-sending the commands when they didn't reach their target machine?

It has to do with whether its at all safe to run the command twice. If you don't hear back from a machine that your command ran, you don't know if the command reached the machine or not, in general. If you decide to not re-run the command, it may be that it never made it, and you never run the command. If you decide to re-run the command, it may have reached the machine, and you've now run the command twice. Consider:

  1. You send command $X to a remote machine.
  2. The remote receives and executes $X successfully.
  3. Just prior to sending back that $X was successful, the remote dies.
From the point of view of the deployment machine, you never hear back from the remote. You don't even need two machines to get into this state: even on a single command, if you run $X and the machine crashes just as $X is about to return whether or not it succeeded or failed, you're out of luck.

For specific commands, such as ones that log their progress and can recover from crashes, you can determine after a crash the correct procedure automatically. But not every arbitrary Unix administration command falls into that bucket, especially in the weird and wacky things that need to be done to keep production machines alive.

> Uhm... What? Where did you get the idea that upgrading database is managing configuration? It's a deployment operation. A different category.

We used our deployment infra for all sorts of fun stuff.


Again, thanks for the objections, they're making me think hard about the design choices.

> Lucky you if your SSH client never failed to connect. How do you think it reports a network failure?

Currently posixcube handles this by checking the return code and reporting the error.

> You see, there are tools that allow you to easily tell between network failure and remote command failure. RPC protocols are an example.

ssh does have some basic differentiation: "ssh exits with the exit status of the remote command or with 255 if an error occurred." Although, of course, 255 might be returned by a remote command (not sure how likely that is), so it's still ambiguous. I guess you're saying that if it's an SSH error, then that's pretty safe to retry and it would be a nice feature, and I agree so I've opened this issue: https://github.com/myplaceonline/posixcube/issues/6

> Now let's get practical: find me such a library.

I guess it would just be a matter of, first, checking for 255, and next, checking the output of the command to see if it's an SSH error for sure?

> We were talking about configuring servers. You basically tell me that I would have to pay for "agentlessness" with tracking state of each and every machine I would possibly want to reconfigure, and remembering to check if the machine was booted to re-issue a reconfiguration command.

So you bring up a good point. On DigitalOcean, for example, their Fedora images use extlinux for booting and I can easily update the Linux command line (e.g. crashkernel), reboot, and it picks it up. Their Ubuntu distribution uses grub, and update-grub works fine, but for some reason, it requires a hard shutdown instead of a reboot to pick up the change. The way I handle this now is here: https://github.com/myplaceonline/myplaceonline_posixcubes/bl...

We echo this instruction to the user. This prints a red-text error message and stops execution. This seems even better than an agent-based execution because with an agent-based architecture, the agent is now gone (hard powered off); whereas, this style of orchestration in the front-end can use a script on top of posixcube.sh that waits a minute, remotely calls DO to start the droplet, and then continues execution (using the same script since it's essentially idempotent). The only other option would be to have a centralized server that orchestrates all of this, which posixcube avoids.

> Sorry, I'm lost here: what re-running? The command didn't get the chance to run in the first place. I don't know where did you get the idea that I propose running the command again if it got through to the remote server and failed there.

See above, I'll need to try it, but I think the ssh error code for "command didn't get through" along with some simple heuristics on the command output /might/ work; I'll investigate in issue #6.


> ssh does have some basic differentiation: "ssh exits with the exit status of the remote command or with 255 if an error occurred." Although, of course, 255 might be returned by a remote command (not sure how likely that is), so it's still ambiguous.

If one uses SSH for running such scripts, then it is not unlikely that the script itself would have another call to `ssh' to an even-more-remote machine.

And then there is the case of not expecting that a command would ever exit with 255 code. My trust to a system would be quite damaged if I happened to run into such a bug that a proper remote error would be mistaken for a network error. How can I, as the script writer, be sure that I never run into this situation?


I agree the ambiguous return code is a major limitation of SSH. I think I'll just avoid any retry logic and treat every failure the same: fail immediately and let the user to decide how to retry. Since posixcube does most of its execution over a single SSH connection (packaging all commands/cubes into a cube_exec.sh which is executed remotely), I think the immediate need for automatic retry on failing to connect over SSH isn't large.


> but you have an agent anyway

By agent, I meant something more narrow: a prerequisite that must be installed on a server before it works. In that sense, the agent here is sshd, but that can be presumed to always exist.

> so once you start to need to configure it, you can break all the things at the same time.

Do you mean if the script updates sshd configuration?

> Oh, and you just dropped proper configuration management. Now you need to add a task queue (another thing that can break and overflow and what not) to re-send configuration commands when some server is down

I'm not sure what you mean, can you please elaborate on a use case of a task queue?

Thanks for the comments!


> I'm not sure what you mean, can you please elaborate on a use case of a task queue?

Answering my own question by re-reading your comment: I'm not familiar with systems automatically re-trying commands. What if there is some fatal error on the server? It seems like this is something that requires a human, not a task queue. Relatedly, if the system has the `caller` built-in, when there's an error, a stack trace is dumped, including the line number of the failing cube for someone to quickly jump to the failing line in the script.

With that said, a retry mechanism could be built-in if it's a big use case.


>> so once you start to need to configure it, you can break all the things at the same time.

> Do you mean if the script updates sshd configuration?

If you ask this question, it means you've probably never configured SSH in bigger environments, because you don't imagine how easy is to accidentally cut off your access. User's public key disappears, user's shell changes or becomes un-executable, user doesn't get mentioned in AllowUsers and/or AllowGroups, user's home gets unreadable, and plenty other failure modes.

>> I'm not sure what you mean, can you please elaborate on a use case of a task queue?

> Answering my own question by re-reading your comment: I'm not familiar with systems automatically re-trying commands. What if there is some fatal error on the server? It seems like this is something that requires a human, not a task queue.

You actually have provided me another important point: how to tell that the command reached its destination server, but failed there? This is vastly different case than when a command didn't get through to the server, so it can be safely re-sent. (And I was talking about the latter case.)

> With that said, a retry mechanism could be built-in if it's a big use case.

But now you have a queue which needs to be managed and that also can break. This is not an improvement. Setup with a dedicated agent that works in "pull" mode (instead of one that was not designed for this task, but just happened to be already installed) doesn't have the problem that needs solving with queue in the first place.


First, I just want to say that I appreciate this conversation and debate. It's making me think hard about certain concepts, so thank you.

> If you ask this question, it means you've probably never configured SSH in bigger environments, because you don't imagine how easy is to accidentally cut off your access.

Many moons ago, I was mucking with sshd configuration, and locked myself out. It was the first and only time I ever had to truly hack - my own system. The system had a Wordpress installation which, at least back then, allowed you to modify the PHP code through the web console. I knew I had gcc on the box, so I found a NULL-dereference root escalation exploit for my kernel version, jammed a bunch of C code into a php system call, and step by step, escalated, and then overwrote my ssh config (all through escaped C code in PHP!). It was fun. So I totally hear ya, updating sshd configuration can be disastrous, and it's certainly required in certain situations. Posixcube is certainly not ready for massive enterprise usage, but I don't see why it couldn't be. For this particular point, it seems that no matter what technology you're using, SSH can always get mucked up, and I see no inherent reason why shell scripts are worse.

> You actually have provided me another important point: how to tell that the command reached its destination server, but failed there? This is vastly different case than when a command didn't get through to the server, so it can be safely re-sent. (And I was talking about the latter case.)

The way this is handled now is that all cube_* APIs check all return codes and if non-zero, they print the full stack and `exit $code` which is passed back through SSH into the client and reported and the script stops. In the latter case of safely re-sending a command that didn't even get to the server, I don't think that's much of an issue here, because posixcube doesn't execute one command at a time (since the overhead of re-establishing the SSH connection would be massive), but actually generates a shell script on the fly (cube_exec.sh) which has all the commands you want to execute, uploads that as well as all of the cubes, and then executes cube_exec.sh on the other side.

> But now you have a queue which needs to be managed and that also can break. This is not an improvement. Setup with a dedicated agent that works in "pull" mode (instead of one that was not designed for this task, but just happened to be already installed) doesn't have the problem that needs solving with queue in the first place.

Are you talking about a use case where an administrator is not monitoring a deployment (i.e. fire and forget)?


> Posixcube is certainly not ready for massive enterprise usage, but I don't see why it couldn't be.

Because its architecture doesn't fit the job. The same situation is with Ansible or Salt being used for maintaining configuration of servers.

> For this particular point, it seems that no matter what technology you're using, SSH can always get mucked up, and I see no inherent reason why shell scripts are worse.

I don't say shell scripts are worse. I say the "push" architecture over SSH is worse for configuring machines than with an agent in "pull" mode. You could easily write an agent that works on a cron-like schedule, downloads commands to be run, (verify their origin with a digital signature,) runs them, and reports the results back, all in shell.

> Are you talking about a use case where an administrator is not monitoring a deployment (i.e. fire and forget)?

No, I'm not talking about deployment. Deployment is a one-off operation (even if similar is repeated in the future). I'm talking about managing configuration, which is what Chef, Puppet, and CFEngine are for. If you were trying to use Chef for deployment, then no wonder you got a terrible experience.


> Because its architecture doesn't fit the job. The same situation is with Ansible or Salt being used for maintaining configuration of servers.

> I don't say shell scripts are worse. I say the "push" architecture is worse for configuring machines than with an agent in "pull" mode. You could easily write an agent that works on a cron-like schedule, downloads commands to be run, (verify their origin with a digital signature,) runs them, and reports the results back, all in shell.

I'm still not completely understanding, so let me see if I can rephrase your objection and please correct me if I'm getting it wrong:

A "pull" architecture for configuration management is better than a "push" architecture because updating configuration shouldn't require synchronous human interaction.

If so, then one example of a counterargument is needing to reload/restart a service after updating its configuration file. What if there's an error reloading/restarting the service? Does the agent then email the administrator? And what if the emails fail?


> A "pull" architecture for configuration management is better than a "push" architecture because updating configuration shouldn't require synchronous human interaction.

One may put it this way as one of the reasons, yes.

> If so, then one example of a counterargument is needing to reload/restart a service after updating its configuration file. What if there's an error reloading/restarting the service? Does the agent then email the administrator? And what if the emails fail?

You're on a good track here. There are several things not quite there yet, though.

First, e-mail reports don't scale. For a dozen of servers the volume is still readable if every report fits on a single screen, but a move in any direction (report's length or number of servers) makes the situation unsustainable.

Second, all of the operations should be reported, not just the failures, as you want to be able to see what was done, when, and in what order. Usually it's not interesting, as most logs are, but there are cases when you need this information and you learn about the need after the fact. Also, there are abnormalities that are reported as successes (e.g. a config that gets changed on every scheduled execution that gets replaced by something else, thus creating the need for updating the config again).

Third, the operation logs should be structured. Plain text doesn't cut it, as you can't process them with a machine in a reliable and convenient way.

And fourth, you were right when you asked what if the report transport breaks down. This is covered by having reports from each execution; a simple automaton (that either queries execution reports from their store or tracks them as they come) can easily tell that a server went missing. Such an automaton would be a part of reports receiver or reports store.

By the way, reading execution reports from a push tool doesn't scale for the same reason as e-mail: too much manual reading. And if the tool stops on the first error to be encountered, tool's use will be annoying in the long run (how to recover and resume operations?)


Thanks, I understand your perspective much better.

Regarding the last point, I've always just re-run my last command. If the commands are broken up into separate independent scripts (e.g. install ElasticSearch, install Kibana, etc.), then I might only re-run the script (-c) that was failing to ensure that works, before re-running the whole spec (-z).


So what happened to Perl ? Back in the day we used to do a lot of the sysadmin automation using Perl. It was pretty awesome. Do sysadmins still use it?


Perl would play nicely with posixcube. Just do cube_package install perl, and then do with it as you wish. Posixcube is more about handling the things outside of Perl, Python, Go, etc. like distribution, installation, etc.


thanks for sharing, I needed something with a bit more automation for my side project. I used small scripts, but I never had a chance to automate more than that.


You're welcome! Hope you might find this useful, and glad to hear feedback.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: