I see weird problems of the sort "It did work before I went on my lunch break" on a fairly regular basis. How often would I like to go down the rabbit hole and explore these problems in such depth, but if I did that, I would hardly get any work done. The frequency at which our users run into these problems is just too high.
It is so frustrating so restart a computer or maybe re-install it and see the problem disappear, because now all hope to understand what caused the problem in the first place is gone. And problems that disappear for no good reason have a tendency to return for no good reason, most often on a Friday afternoon, just when you're about to call it a day. ;-/
In many years of doing this type of work, I've noticed this fear of your neck being on the line is a myth. Heads rarely roll merely because something goes wrong.
I've also noticed almost every sysadmin believes that myth, and it can make us very difficult to relate to or even work with. Ironically, that's a problem that may actually cause you to lose your job.
Usually you can trace the insecurity to mismatched assumptions, unhandled edge cases or a failure to consider global contexts (like where you accidentally turn one function into a decryption oracle for another).
A problem shared is a problem halved. In this case every web developer needs to have a network of people to collaborate with in order to diagnose the problem.
Open-Source software has bugs, but it gives you the ability to look into the issues directly; allowing you to figure it out if you want.
The cool thing in open-source is you can always absolve yourself of responsibility and hire a vendor or maintainer or others to help fix the problem if there is one. Which means you have N potential solutions to a problem, compared to just 1 with a closed source vendor.
With DSC, I am increasingly excited with the idea end user computers can also shift to immutable, or approaching what we call immutable, infrastructure that has logically valid weight in the communities represented here.
The problem? Culture. So many people do not understand when I say the following things:
- Do not install with the GUI, please use the deployment system to document unattended installation.
- You should not have to log into a computer and manually install and customize things, especially if you forget to document.
- You should not be doing things, install, configuration change, with following this AND updating the team in your notes, so when I ask you, you in fact remember.
These things are the product of less fires, as I see it, and why I too have to re-image everything (although is largely out of concern for any malware, especially with the majority of our userbase).
Sometimes this is a bit annoying when you just want something to work but can't just hack it because the system files are hidden all over the place and papered over with symlinks, but in the long run it means your system configuration is way easier to maintain.
I have started to look into a career transition into DevOps (please do not laugh, feel free to downvote) not because of how cool it looks or the increasing culture around it, but I need a break from the mainstream of throw everything and anything on the wall until it sticks without rigor, or your time is cheap automation is a very, very low priority since we pay you to do the tasks we dislike as more serious engineers (who also build some systems with checklists).
So far, though, I have not managed to actually get to know it personally, so to speak.
We have over the past two years tried to move as much of the configuration as possible to GPOs, although they bring their own share of problems. On Unix-like systems, one can use log files to track down problems most of the time, on Windows it seems like logging is kind of an afterthought. I especially hate it when gpresult says it did apply a certain setting when upon inspecting the system the setting clearly has not been applied (or overriden by something else? Who knows?).
It is frustrating because GPOs seem like such a great idea in theory.
GPO, and GPP (the preferences), are a nightmare.
- The gpresult utility and rsop.msc tell you changes, but that does not mean much, because
- GPO is async and requires sometimes a shutdown, not just a reboot (fom 5+ years of experience), so good luck dropping this crap on a dime; God help you
- The Registry.pol files are not easily auditable or usable outside gpresult and rsop.msc
- If you hit slow links, processing will be disabled; this is not slow all the time, but when 2% of the computers booting at 6:54AM on Tuesday it does not get applied
- A whole bunch of other stuff I forget in this rant
But I totally agree with you, such a pain. I have not had a lot of time for Salt, Puppet, and Chef. Whether it is in parallel or thanks to DSC (I saw some Powershell in one repo ... Ansible?) those tools are also a reality.
I use an expensive SCCM alternative, but I am seriously considering proposing moving to one of the Chef/Puppet/Anisble/Salt stacks with SSH becoming a reality on Windows.
All I can say is thank God.
Obviously the latter is the best way, but it's interesting that the culture of the two systems is so different, no doubt borne from Windows' legacy era.
Most concerning for me, though, is what safety experts call "normalization of deviance". It's the process by which people become accustomed to small failures, which creates opportunities for big failures to happen. A big example is the Challenger disaster. 
I see shops with low bug rates, where people think a lot about quality. And I see shops that, thanks to high bug rates, are too busy fighting fires to ever spend much time on quality. I never see any place in between. And I think normalization of deviance is why.
When other incidents are logged, if you have defined the problem well enough then you search for a problem record that matches the incident symptoms and link it to the problem record. The problem record also holds the workaround that use used to get the end user up and running so you can use this if the issue is super critical.
If you find that you've linked a certain number of incidents to the problem, then you know you it's actually worthwhile doing root cause analysis and spend the time figuring out what is causing the error, so you go down the rabbit hole - and you can justify the time to do so.
When you figure out the root cause, if it's a simple resolution that doesn't require a major change to the environment then you may not have to do much to prevent the issue in future - sort of depends on the complexity/needs of your environment and organization. But regardless you raise a known error record and link the problem to this. In the known error record you document the problem and as many symptoms as possible (some people list workarounds here, others list the workarounds in the problem record, other prefer to keep workaround info strictly in incident records), the root cause and how you resolved the issue fully.
Regardless, mainly from the known issue record if you find you need to make a scheduled change that may impact environments then you lodge a change request through the CAB processes you have in place.
Normally I find for server and network infrastructure the change just requires coordination with teams who use the infrastructure, which if you've setup your CMDB properly you can work out by backtracking the infrastructure configuration items to linked services. I've found that if you have defined your service catalog properly then you will have defined your operational services and linked these to business services that are mostly customer facing. This helps impact analysis and finding the correct window in which to make the change.
For things like fixing application bugs, I have found that it's still worthwhile raising a change request, then have that change moved into the development fix process with all appropriate testing, etc - normally this then links into a wider release management process which may actually require a new overarching change management request as other fixes are part of the change - sometimes you need to review the impact of how deploying the release might impact the environment in unexpected ways.
I that's basically a big chunk of ITIL, and I found that if it's done correctly and busy-work is reduced (mainly by asking for too much info), when an appropriate setup of the service layer is made and the CMDB has been mapped well, then it actually can help medium to large organizations pretty effectively. The key is to define a catalog of services across the business, without this it's hard to know the impact of incidents, bugs and any changes you may want to make in your environment.
It's been a while ago since I worked with a Windows network, 15-18 years or so, but graphs from that was enough to prove investing in multi-purpose on-site free support network printers was a good idea. Support logs dropped by about 20% if I remember correctly (lots of crappy inkjets) and it saved the company some money in printer repairs and not buying ink cartridges and toners all over the place.
After that budget talks and time for in-depth problem solving became easier.
We ended up in something ITIL like naturally. We just started scripting solutions naturally and shared them between each other. Some of those scripts ended up being pushed to clients so traveling sales people could remap network drives and other simple things. Then we wrote a GUI for them - because clicking is easier apparently. That didn't work properly but proved the case for remote control/monitoring/inventory software (well, control really but it was IT buying the software so..)
It probably did help a little that I wrote in C and my co-worker at the time thinks x86 assembly is self documenting.
Now days I develop and use Linux for basically everything except gaming. Friday horrors persists though. This week it was trying to find a solution to a problem in others code that include sql triggers, framework triggers, various code components and quite a few custom sql tables/relations that I haven't worked with before.
File attributes override the nix permissions such that you can set the immutable flag on a file and even root can't modify it. `chattr +i FILENAME && rm FILENAME`
On distros that use capabilities copying a file doesn't copy capabilities by default. ie copy the ping program and it wont work unless you're root.
When SElinux blocks an action the error message is almost always wrong. ie A program tries to make a TCP connection which it doesn't have permission for. Instead of an error message like "SElinux violation" you get an error like "No route to host". To debug you need to look at the SElinux audit.log and try to match up timestamps of violations to when your program died.
Also, it seems to only check if the * is near something else, not if it is before or after. Nor if the after is after a before (if that made any sense at all).
What made you think it's Markdown implementation? It's pure text, with
paragraphs delimited with empty lines, code blocks being prefixed by space (or
two, I never remember) and emphasis being marked by asterisks. There's
Or rather, I did until this week, when it suddenly stopped working.
When this happens to me, the first thing that I ask myself is "what changed?", and I'm usually able to track down the cause to some configuration change. Incidentally, this is also why I never like modifying a working system unless it's absolutely necessary.
The fact that it ultimately was caused by some security features that would be very important for a multiuser shared server but nearly irrelevant for the (presumably) single-user local machine that he is using suggests that perhaps we shouldn't be thinking of "one size fits all" paradigm for OSs, since a lot of problems like this one stem from the unnecessary extra complexity introduced by such thinking.
It's why I find auto-updating apps so infuriating. The trend is that every app, OS, and driver insists on being self-updating. It's going to be very difficult to maintain a reliable system if you're doing anything complex.
Privacy issues aside, that's another reason I never plan to use the continuously self-updating Windows 10.
There really isn't a good answer either way, but between "breaks occasionally" and "needs a full-time admin, but updates are vetted", I prefer option 1 for my private systems, and option 2 for things that run in production.
So you can't just say you want security fixes only, no new or changed features.
Going the other way requires developers to maintain a variety of old versions of their code so they can backport security changes. Which is a lot of work for them for very little extra value.
Also applies to Ubuntu, probably Red Hat, though the latter's vastly smaller repos mean vastly greater reliance on third-party sources, and concommitant risks of introducing/changing features when security fixes are wanted, or riding bareback without security updates.
There's also the inherent conflict between running current code and fixed code. Debian's legendary conservatism reflects a bias toward the latter, at least on its stable branches. Of course, you're welcome to lead and bleed on testing, unstable, or experimental, if you so choose.
I develop browser-based software which, after one Chrome update, was rendered unusable by a bug in Chrome. Fortunately, Google pushed out a new version with a fix the next day.
This also seems like a good reminder that transparent updating only works if your team is good and responsive enough to fix mistakes on the fly. If you're rolling out one update a month, you'd better give people a choice so they can decline when you hand them faulty upgrades.
this is safe, but it paints you into a corner over time, where you become paralyzed and can't improve anything. Needs better testing, so changes are safe.
At one place where I work, we're on nodejs 0.10, which is several release versions behind. It's causing us a bunch of problems, because while 0.10 is still technically not EOL'd yet, npm modules behave like it is... however we've left it so long, that the jump to current stable is a giant task, which we don't have the time for given other business reqs.
Tests indeed don't guarantee safety, but lots of small changes are easier to deal with than the occasional massive change. It's also the basic concept behind version control.
This story wasn't about trivial day-to-day developer bugs, but what kind of problems happen in really complex systems.
As someone who went through the process of a painful, long delayed upgrade not too long ago I definitely second this, although as a much more generalized principle I think it'd be more accurate to say that there's a fine, eternal balancing act between "work" and "meta work", and that this principle applies to way more areas of life then systems work. However much fun (or "fun") it may be, as mjd said there most/all of us primarily have work to do using our tools ("tools" being in the most generic sense here, including knowledge) rather then working on our tools. To some extent, a few days spent on tools/skills is a few days not spent applying them, and it's all too easy to sink so much time going down various rabbit holes that "actual work" loses out. But of course on the flip side improving our tools/skill sets is key to realizing major boosts in long term productivity, keeping up with changing standards, and so on. I remember a few years back at one workplace when a number of senior engineers (50s/60s) all finally bit the bullet and started to work to get up to speed on the latest CAD developments. Or myself a decade back when I decided I really needed to update my shell usage, read the full ZSH manual and spend some time seeing how I could improve my speed in general. There were many significant projects going on, but then there always were, always something that "needs to be done next week!". I personally find it can be a tough balancing act to optimize the savings gained from increased productivity down the road vs the time expenditure needed to begin realizing them in the first place, particularly if "everything is working fine". I know that over the years I cumulatively lost plenty of time on manual involvement in tasks I could have automated, but each individualized instance seemed trivial and it was easy to default to just hacking something quick and getting on with the day vs deciding it'd be worth spending time to improve it for good.
Of course that's all assuming there aren't any other barriers in the way. My extremely oddball pain point on one workstation was that I'd enthusiastically built an tower Mac Pro OS X system around ZEVO, an short lived attempt to salvage Apple's old ZFS work and bring a fully functioning version to OS X. And despite a few niggles (some which didn't matter to me, like CLI-only), by the time it was getting ready to go it was fantastic, nicely integrated and all that. I was pumped, it was exactly what I'd wanted under OS X ever since I'd seen Sun's original presentation, and I hopped fully onboard. But of course the company developing it promptly went under just as they were launching, were bought for IP/people by GreenBytes (which itself was subsequently acquired by Oracle), and after a single bug release that was it. It only worked under 10.8 and not one version later, and there was no clear upgrade path (I really didn't want to revert that system back to pure HFS). So 10.8 was where I stayed until OpenZFS and in turn O3X came along to save the day, but by that point I was out of the habit of frequent upgrades there. Testing is definitely helpful (along with a nice rollback system) but sadly can't always save you, frequent upgrades definitely help keep key meta-knowledge fresh.
This was a really cool bug track down article though, and inspiring.
"You can check your anatomy all you want, and even though there may be normal variation, when it comes right down to it, this far inside the head it all looks the same. No, no, no, don't tug on that. You never know what it might be attached to. " - Buckaroo Banzai
Also, I'll be an advocate for just starting emacs with systemd, and never worrying about it again.
EDIT: fixed incorrect description of the yak-shaving conclusion.
The dynamic loader did so because it ran with an extra capability that the user invoking it didn't already have. Most of that sanitizing exists to prevent the user from gaining those privileges themselves by invoking a more privileged program, such as by setting LD_LIBRARY_PATH. Sanitizing TMPDIR prevents a somewhat different class of vulnerabilities, such as using those extra privileges to write to files you normally couldn't. However, I don't think it makes sense to have a complex special case like "if you only have one of this subset of extra privileges, allow TMPDIR but don't allow all the other potentially dangerous environment variables"; that adds a significant amount of complexity and subtlety to already security-sensitive code.
Giving /usr/bin/perl itself extra capabilities effectively grants them to every user on the system, since you can use Perl to run arbitrary code. At that point, it would make more sense to just allow all non-root users to bind to arbitrary ports. I'm somewhat surprised that there isn't a sysctl to disable the reservation of ports 0-1023.
I think it'd make more sense to have a collection of lockdown functions which are run for each capability, with the action functions run being the union of the collections of each effective capability (with full root just being the union of the collections of all capabilities).
Or, y'know, rethink root in general. Plan 9 had good ideas in this area …
The conclusion still holds, though: I don't think special-casing particular capabilities makes sense. And in the case of the dynamic linker, it doesn't actually have that information available; it relies on the AT_SECURE bit set in the process's "auxiliary vector" (see "man getauxval"), which the kernel sets when the process has any privilege its caller didn't have.
There is: /proc/sys/net/ipv4/ip_local_port_range
I checked for a sysctl controlling the ability to bind to privileged ports before writing my comment. The relevant code in the kernel compares against a hardcoded #define PROT_SOCK 1024, and doesn't have any means to disable that check. See inet_bind in net/ipv4/af_inet.c .
$ sudo sysctl net.inet.ip.portrange.reservedlow=10
net.inet.ip.portrange.reservedlow: 0 -> 10
$ nc -vvl 1
$ sudo sysctl net.inet.ip.portrange.reservedlow=0
net.inet.ip.portrange.reservedlow: 10 -> 0
$ nc -vvl 1
nc: Permission denied
No. You can theoretically use tmp to gain any privilege. So anything extra at all must be blocked.
Not when you access it as yourself rather than root. The capability in question only grants access to open low ports; there's no way to combine that with files in a temp directory to get root.
Much like companies should try to replace their own products (before a competitor does), infrastructure teams need to force “predictable” upgrades in a controlled environment on a regular basis. For example: look at your dependencies, imagine what upgrades are likely to be required in the near future, and try making those upgrades on test systems to see what could go wrong.
That approach achieves three things. One, since you’re not in emergency mode and you’ve used a test environment, any problems that you do uncover are not going to cause a crisis. Two, if you do this semi-regularly then you’re likely to see only minor issues. Third, exploratory upgrades give you a lot of time to fix problems (whether it’s time for your own developers to make changes, or time to wait for an external open-source-project/vendor to make changes for you).
1) On Windows, VisualVM not being able to find the IntelliJ IDEA process I had running. This happened because IDEA was started with Launchy, which had a different TMP directory set, because I used both RDP and the Console, or something.
2) On Linux, ibus IMEs not working at all in my browsers; happened because I was starting them in a tmux, and the tmux server was started in my previous login session, so the DBUS_SESSION_BUS_ADDRESS in the tmux was stale.
IF we want to avoid said complexity we basically have to go back to running only one process at a time, loaded directly from dedicated, removable, storage when needed.
What changed to make the perl become capable whereas previously it lacked the low port capability?
Also, a recent change was an admin setting TMPDIR, previously it had been left unset.
So it was a combination of two things that triggered it, and not one alone.
And this basically sums up all the troubleshooting time sinks I've experienced in my career.
I will confirm this with the sysadmins and add it to the article. Thanks!
If you actually are talking about "emacs" and not "emacsclient", then the argument you want is --eval '(setq server-socket-dir "/what/ever")'