
Ask HN: How do you document and keep tabs on your infrastructure as a sysadmin? - redsec
I am wondering how do experienced sysadmin document and manage their infra.
======
cik
We use Collins
([https://tumblr.github.io/collins/](https://tumblr.github.io/collins/)) as a
Configuration Management Database, Ansible
([https://www.ansible.com/](https://www.ansible.com/)) for automation,
Terraform ([https://www.terraform.io/](https://www.terraform.io/)) + a bunch
of homebrew for orchestration, Packet
([https://www.packer.io/](https://www.packer.io/)) for multi-cloud (and
hypervisor) image creation and maintenance, powered by Ansible. Every since
thing is committed to a series of bitbucket
([https://www.bitbucket.org](https://www.bitbucket.org)) repositories.

We connect Ansible and Collins through ansible-cmdb
([https://github.com/fboender/ansible-
cmdb](https://github.com/fboender/ansible-cmdb)), then tie the entire thing to
our ticketing systems ServiceNOW
([https://www.servicenow.com/](https://www.servicenow.com/)) and Jira Service
Desk ([https://www.atlassian.com/software/jira/service-
desk](https://www.atlassian.com/software/jira/service-desk)), and finally,
ensure we have history tracking with Slack
([https://www.slack.com](https://www.slack.com)).

As a given, we yank test the entire world. If it doesn't pass a yank, it
straight up doesn't exist.

Whether it's bare-metal, virtualized, para-virtualized, dockerized, mixed-
mode, or cloud - we 100% do this all the time. There is not a single change
across any environment, that isn't fully tracked, fully reproducible, fully
auditable, and fully automated.

~~~
woodrowbarlow
what do you mean by "passing a yank test"? i assume "yank test" refers to
unplugging the network cable abruptly from the server under test, but what
exactly are you looking for when you do that?

~~~
cik
A yank test on process and infrastructure is more than a 'did it come up'.
It's a "if we totally nuke the thing" \- say, were we to rip the hard drives
out of a server, fry it, and recreate it - does it come up identicall(is).

That way we know our CMDB is accurate, our workflows are accurate,
credentials, ansible, terraform, images, etc. Right down to tickets.

It's how we manage all of our cloud customers.

------
vinceguidry
When I was working as a sysadmin, I kept a spreadsheet. I was told later of a
repository of information that supposedly was what my spreadsheet did, but it
didn't add anything new and was much harder to keep up.

I built it up using nmap and then shelling into each individual machine and
poking around to see what it did. This was back in the days before everything
became virtualized, so each machine on the network was likely physical.

I added information by walking the aisles and copying down the rack location
of every machine into another page on the spreadsheet. I eventually hooked up
a terminal to them all and matched network addresses to physical machines.

Only took a few weeks and when I was done, I knew things about the network
that guys who worked at the business for years didn't know.

There's no substitute for the good old-fashioned way.

I liked that job, it was fun.

~~~
majewsky
If you're by yourself, using spreadsheets and nmap is usually fine. If you're
working in a team of 5 or 10 or 50 sysadmins, spreadsheets turn into a huge
mess. You either have to distribute them via mail etc. after every change, but
then you will have concurrent edits that need to be merged manually. Or you
put the spreadsheets on a network share with file locking, but then it will
always be locked when you want to edit it because someone is working on an
entirely unrelated part of the infrastructure.

So you have exactly those sorts of problems that RDBMS are designed to solve.
Therefore it makes sense to move to a DCIM system using an RDBMS under the
hood, that allows for concurrent edits, and also can be accessed by automation
(cronjobs, CI, etc.) via some sort of API (or direct DB read access).

~~~
sedachv
There is an even better alternative. You can put infrastructure information
into the same version control repository where your infrastructure code lives,
and you can even keep all the benefits of spreadsheets by using plain text
format spreadsheets like Org-mode tables.

This means you do not have two sources of truth to maintain (what is in the
RDBMS, and how that relates to what is in the infrastructure code repository),
the RDBMS system does not have to reinvent versioning, you can see exactly how
your infrastructure evolves, you can do atomic changes to both the
infrastructure code and the infrastructure information that the code relies on
(obviously you need a modern version control system for this), and the
infrastructure code can access the infrastructure information in a much more
straightforward (and much easier to test) way.

~~~
dmannorreys
This would become very exhausting if working with very large infrastructures.
80 000 virtual and physical servers? Have fun keeping that data consistent, up
to date and available with Org-mode and version control.

I'm not saying your example is wrong, but "there is an even better
alternative" doesn't always apply. For smaller scales, sure.

~~~
sedachv
VMs need to be kept track of in whatever system you use for provisioning (AWS,
OpenStack), otherwise you now have three sources of truth: what the
configuration says should be running, what the DCIM thinks is running, and
what is actually running.

------
majewsky
\- keep inventory in a DCIM (we use Netbox)

\- configure _everything_ as code (we use Ansible for the infrastructure up to
OS level, Kubernetes w/ Helm for applications), have it read the values from
the DCIM so that the DCIM remains the single source of truth (we need to still
get better on this part....)

Links:
[https://github.com/digitalocean/netbox](https://github.com/digitalocean/netbox)
[https://www.ansible.com](https://www.ansible.com)
[https://www.kubernetes.io](https://www.kubernetes.io)

That's at work. At home, I do much of the same, except that maintaining a DCIM
is excessive for 2 VPS and a home network of 3 boxes.

~~~
rubenbe
I cannot comment on the DCIM side, but I agree on the "everything as code"
mantra.

For a relatively small setup I chose a combination of Ansible, Kubernetes and
Dockerfiles, but probably any combination will do. All these files are stored
in a git repo.

Even after months (or years) neglect, I can easily know what I configured (and
why!) and update where needed with a minor effort.

------
zie
I'm going to mostly disagree with everyone here, much to my karma's detriment
;P

I agree the end-goal should be infrastructure as code, and everyone here has
covered those tools well. You also want monitoring across your infrastructure.
Prometheus is the new poster-boy here, but the Nagios family, and many other
decent OSS solutions exist as well.

But you still need documentation. Your documentation should exist wherever you
spend most of your time. Some examples:

* If you spend most of your time on a Windows Desktop, doing windows admin type things, then OneNote or some other GUI note-taking/document program makes sense.

* If you spend most of your time in Unix land(linux, BSD, etc) then plain text files on some shared disk somewhere for everyone to get to, makes WAY more sense. Bonus if you put these files in a VCS, and treat it like code, and super bonus if your documentation is just a part of your Infra as code repositories.

* If you spend your time in a web browser, then use a Wiki, like MediaWiki, wikiwiki, etc.

In other words, put your documentation tools right alongside your normal
workflow, so you have a decent chance of actually using it, keeping it up to
date, and having others on your team(s) also use it.

We put our docs in the repo's right alongside the code that manages the
infrastructure.. in plain text. It's versioned. We don't publish it anywhere,
it's just in the repo, but then we spend most of our time in editors messing
in that repo.

~~~
antoncohen
I totally agree, but having "infrastructure as code" means less documentation.

Instead of documenting all the commands involved in configuring a machine as
service X (ssh, run apt-get, paste this, etc.), I have documentation on how
work with the configuration management system (roles in the roles/ directory,
each node gets one role, commit to git, open PR, etc.). That documentation is
in .md files in the config management source repo.

Instead of documenting how to rack a server (print and attach label to front
and back, plug power into separate PDUs, enter PDU ports into management
database, etc.), I document Terraform conventions (use module foo, name it
xxx-yyy, tag with zzz, etc.).

It ends up being less documentation, as the "code" serves to document the
steps taken, so the documentation can be higher level. Or if it isn't less
documentation, it is documentation that needs to be updated less often, so
hopefully there will be less drift between docs and what actually exists.

~~~
hobofan
Ah the good old "self-explanatory code that needs no documentation".

~~~
marcosdumay
More like code usually required extra documentation explaining it in a higher
level language, but nowadays we just write the program on that higher level
language so this extra documentation has gone away.

------
antoncohen
It might be helpful if described your infrastructure. There is a pretty big
difference between managing physical Windows servers in a data center and
managing Linux servers all in AWS.

If you are all or mostly cloud, Terraform + config management with a CI
pipeline takes care of a lot. Then a wiki that covers "Getting Started" and a
few how-to articles.

For physical infra you need the setup for DHCP, updating DNS based on DHCP,
PXE boot imaging, IPMI access and configuration, switch and router
configuration, what servers are connected to which switch ports, PDU
management and monitoring, and on and on and on.

You end up with something like NetBox
([https://github.com/digitalocean/netbox](https://github.com/digitalocean/netbox))
or Collins
([https://tumblr.github.io/collins/](https://tumblr.github.io/collins/)), plus
a bunch of other stuff gluing things together.

~~~
evangineer
For future work, I would definitely consider NetBox and Collins as alternative
options to GLPI.

------
beh9540
I think it depends a lot on the size of your infrastructure. I've used excel
docs on a shared drive pretty successfully where there's not much to keep up
on and changes are few.

In larger infrastructure setups (small service provider) we used a combination
of netboot, SNMP for monitoring with Observium and Nagios for alerting. We
were also a big VMware environment, so naturally we had a lot of inventory
tracking available through vCenter as well. I found a lot of opposition to
Configuration Management, given the lack of comfort with programming of some
sysadmins (Windows admins), so that's something to keep in mind as well. I
think mixed environments also can be challenging w/infrastructure as code, but
I'd be interested to see how others get through that.

------
seorphates
The past decade has been interesting and I'm still processing it.

My current thoughts are that an appropriate approach is for your systems to
document themselves via the applications that they run - inside out.

Though I must abide I cannot fully subscribe to "infrastructure as code"
anymore. It has proven just another shift, primarily in toolsets and who (or
what) gets say and sway over the capacity, capabilities and efficiencies of
the thing you actually care about - the app stack and all of its assembled
functionality.

In other words most approaches are still "outside in" \- one defines 'x' for
deploy fitments and that typically over and over and over again and,
typically, with a rigidity that can too easily override and overrule
effectively caging your application in scale and scope. With my current tact I
am trying to provide for 'y' to "self identify" (via some/any form of config
mgmt) where from here you can begin to effectively "deploy to any" by hooking
the "application config as code" that, in turn, defines its infrastructure and
deploys "outward". The "infrastructure as code" then becomes the servant with
its objects and platform definitions etc. and the "appconfig as code" becomes
the master where the latter defines its own scope and scale.

Infrastructures have a funny way of mutating into inefficient "definitions" of
something that once made sense, on the first day, and forevermore complicating
progress with capacity, rules and opinions.

But, generically, snmp is still pretty cool for telling me what I need to
know. Strapped that into any end engine and, boom, ask any question, request
any inventory.

So.. I track apps, not systems. Systems are expendable, applications are not.

------
brudgers
I don't do devOps but if I did...

[http://howardism.org/Technical/Emacs/literate-
devops.html](http://howardism.org/Technical/Emacs/literate-devops.html)

[https://www.youtube.com/watch?v=dljNabciEGg](https://www.youtube.com/watch?v=dljNabciEGg)

------
itomato
There are several classes of "infrastructure" as a sysadmin; legacy, new and
critical.

Legacy stuff is done the old fashioned way - portscans and nmap. If it has an
open port, it's presumed to be intentional. If not, it's a target. I've seen
some success using tools like Pysa to "blueprint" existing systems into Puppet
code. Tools like SystemImager help here, too - enabling P2V and the creation
of "file-based images" compatible with version control and able to PXE boot
new clones.

New stuff is from-scratch IaC all the way to the metal. Ansible and git
submodules help me build "sandwiches".

Critical stuff blurs the lines. The machines, IP addresses, ports and living
connectivity can be documented, and "captured" to a limited extent with the
manual mapping and Rsync stuff in the Legacy category. Some of this critical
stuff is also "new", and is deployed in that fashion.

What about switchgear and Cisco configs? License strings, key management,
site-specific patching - all can complicate things.

More important than any of these is the ability for you and those around you
to see and manage the systems as they are launched and terminated.

In the old days, I used to use a shell script on a newly-provisioned host to
dump all its' details - dmidecode, environment stuff and so on. Those details
were pushed back to a common source and were a real benefit in the days before
_real_ config management came on the scene. CFEngine was way too complicated
and nebulous at the time.

------
falcolas
For me/us, it's a combination of infrastructure-as-code and metrics
reporting/logs. Most of our boxes are swapped out on a weekly or more frequent
basis, so the only accurate picture of what's running right this moment is the
graphs built by the metrics collection tools. The only accurate picture of
what's running on those boxes is the code which built the infrastructure.

There are a couple of exceptions, but those are actively being brought under
the above model (mostly because they are effectively invisible, and the
existing documentation for them is... incomplete).

Any documentation outside of that is stale in a few hours, and obsolete in a
week.

------
jcadam
Back when I was put in charge of IT Lifecycle management for my Army unit (not
by choice - "Hey, you've got a CS degree, so anything tech related goes to
you"), I kept it all in an Access Database, and ran off a report occasionally
to update my smartbook (3-ring binder full of stuff that my boss would
frequently ask about during meetings). Granted this was back in the early
00's.

------
owaislone
Terraform + Datadog + Cloudwatch

[http://terraform.io/](http://terraform.io/)
[http://datadoghq.com/](http://datadoghq.com/)
[https://aws.amazon.com/cloudwatch/](https://aws.amazon.com/cloudwatch/)

------
atsaloli
As a professional sysadmin, my go to reference on this is "Documentation
Writing for System Administrators", from the Short Topics in System
Administration series.

[https://www.usenix.org/short-topics/documentation-writing-
sy...](https://www.usenix.org/short-topics/documentation-writing-system-
administrators)

Also, this talk was very good:

[https://www.usenix.org/legacy/event/lisa08/tech/gelb_talk.pd...](https://www.usenix.org/legacy/event/lisa08/tech/gelb_talk.pdf)

~~~
jlgaddis
It's worth the $5, I assume?

------
allsunny
I've used [https://www.racktables.org](https://www.racktables.org) with pretty
good luck. It's PHP, which wouldn't be my first choice, but I've largely been
able to make it do what I want.

If you want something more clever; say keeping track of asset values etc,
you'll want a CMDB. Google around and you should find something that fits your
needs. We used SeviceNow in a previous life.

------
paydro
We put everything in code. We have several layers, but they if you're new you
can start with the lowest level and make your way up to find out how things
are provisioned and configured.

We're on AWS so we use cloudformation for provisioning and saltstack
([https://saltstack.com/](https://saltstack.com/)) for configuration
management. Cloudformation templates are written using stacker
([http://stacker.readthedocs.io/en/stable/](http://stacker.readthedocs.io/en/stable/)).
All AWS resources are built by running "stacker build" so nothing is done by
hand. We have legacy resources that we're slowly moving over to
Cloudformation, but more than 90% of our infrastructure is in code.

On top of cloudformation and salt we built jenkins (CI and docker image
creations), spinnaker (deployment pipeline), and kubernetes (deployment
target). The jenkins and spinnaker pipelines are also codified in their own
respective git repos.

All the repos here have sphinx setup for documentation purposes and the repos
tend to crosslink for references.

------
rbjorklin
I’ve found Zabbix works decently well and also covers monitoring. Zabbix Maps
can be nice to visualize the infrastructure:
[https://www.zabbix.com/documentation/3.4/manual/config/visua...](https://www.zabbix.com/documentation/3.4/manual/config/visualisation/maps/map)

------
bradknowles
So, one problem I’ve seen with most infrastructure as code solutions and CMDBs
is that they do a good job at the tactical level (more or less), and help you
answer “how”, “where”, “what”, and maybe “when” questions (depending on how
well they support orchestration), they typically do a bad job at the higher
level strategic “why” questions.

So, why do you structure your lambda jobs accessing CloudWatch Logs that way
as opposed to the other way? If you didn’t know that one way works and the
other doesn’t, you wouldn’t be able to understand that question. And that
might have domino effects on other parts of your system.

I haven’t found a good solution to documenting the high level strategic “why”
questions, other than to just write down the questions and the answer, with
reasoning, in some form of associated documentation — maybe in a wiki or
something. But, of course, the underlying issues may change in the near future
and invalidate the reasons for your decision. And the high level documentation
doesn’t have any way to be compiled directly into the lower level
implementation, so of course there is always the risk of drift.

I’m still looking for good solutions in this space.

------
tyingq
Vmware's tagging support is a lighter, more realistic option vs a "CMDB".

Come up with a key/value strategy that covers your need to track things like
app name, app category, environment (test, dev, load testing, prod, prod/dmz,
etc), and it becomes actually usable and up to date versus an always out-of-
date CMDB. And it's compatible with cloud resource tagging.

Sometimes, less is more.

------
outworlder
Spinning up new infra: Jenkins crafts Terraform tfvars based on user input,
runs plan, asks for confirmation, applies. Terraform state and vars saved to
S3. Chef and Ansible for provisioning.

"Documentation", in terms of where stuff is deployed and what is deployed is
not really necessary. We save this data to a DynamoDB table, query-able by AWS
Lambda functions, so other automation can pick it up and devops can query
data.

Documentation on how things work comes from dev teams, on how things are
deployed indeed comes from us, just simple wiki pages.

Services running in Kubernetes, K8s worker instances in auto-scaling groups.
If one node dies it is killed and brought up, K8s will reschedule the pods.
Same for the pods themselves.

Monitoring through Nagios(getting phased out finally), NewRelic and
Prometheus. Basic ELK stack for centralized logs.

Thinking about rolling out Vault for credential management. Chatops on the
pipeline (getting pieces in place first, like the db mentioned earlier)

I'm trying to get the company on board on immutable infrastructure, but it is
proving difficult.

------
rootsudo
I use One Note.

But I also use the o365 Suite.

Mediawiki is also good, but can be a bore to run another service for that.

But in the end a textfile via notepad/nano is all you need, really.

~~~
jftuga
OneNote over a text editor as you can drop in screen shots.

------
FatalBaboon
Like many here, I keep it described in ansible and documentation inside a git
repository.

But I feel like it's lacking. After a while you have so many ansible playbooks
and roles that they cannot give you a birds-eye view anymore.

I think I would MUCH prefer to have some sort of HTML representation, where
adding an instance/service starts by adding to that representation, and you
could click on every link or node to show its golden image setup, ansible
configuration, etc.

THAT, I could show to a newcomer and he'd get it.

~~~
petepete
I'm no expert but doesn't Ansible Tower do that?

~~~
FatalBaboon
Ansible Tower lets you execute a playbook via a web GUI, and keeps a log of
who executed what.

I'm not sure if it also shows some infrastructure graphs, but I'm talking
about knowing if links are up, how they are firewalled, where the config for
each thing is, etc.

When you host tens of services on hundreds of machines, this information is
hard to get a grasp on, no matter what you do or how well you documented
everything, because it takes a while to read through it.

------
richardknop
By having your infrastructure defined in version control using some sort of
domain specific language. For example, by using Terraform and only ever making
changes to your infrastructure via Terraform (manual adding/editing of stuff
in AWS/GCP console should be disabled so people can't do that). Then all
changes to the infrastructure are clearly documented in version control with
pull requests.

------
tyingq
Aligning a VMWARE tagging strategy with a cloud tagging strategy is one of my
current goals. Things like a full blown CMDB seem to always end in pain, lag,
and orphaned records. I'm happy enough with something basic that spans on-prem
+ cloud.

------
tmikaeld
I use:

\- [https://www.bookstackapp.com/](https://www.bookstackapp.com/) for portable
(Markdown), searchable (SQL), manageable (Users) documentation.

\- Ansible for automation and deployment.

\- Prometheus for monitoring all the Proxmox nodes and containers.

------
tootie
Can I piggyback and ask how people keep track of deployed software? Like if I
have 50 products deployed some of which haven't been touched in 10 years and I
want to be able to ramp up a developer to fix a bug on any of them?

~~~
jschwartzi
In the medical device industry you keep what's called a device history file
which tracks the configuration of each device you've sold by serial number.
This DHF is meticulously updated whenever something is changed. If someone
reports an issue this is information you can use to scope your initial
reproduction.

~~~
forgottenpass
I think you're mixing up DHF and DHR. Design history file, device history
record.

------
evangineer
GLPI with FusionInventory for IT Asset Management and Knowledge Base.

GitLab for repositories, adhoc documentation via gists and CI/CD.

Nagios for monitoring.

Open to trying other things out if they make sense.

------
peterwwillis
Asset management systems and network inventory databases.

------
skyisblue
Those using AWS ALB, how do you monitor your traffic in realtime? I want to
aggregate host names, ip addresses, user agents in realtime.

------
thrownaway954
Lansweeper ([https://www.lansweeper.com/](https://www.lansweeper.com/))

------
cat199
anyone have any pointers for simple an API driven managment of DNS/DHCP?

(like, I don't want to have to configure 1000 moving parts)

typically this seems to fall into the 'roll your own' or 'giant lumbering
enterprise behemoth' category that does 10 other things. I'm looking for the
sweet spot.

~~~
matt_wulfeck
At any reasonable scale you typically wouldn’t use plain DNS if you have to do
that kind of figuration. It would be done with a service discovery service
which handles SRV records.

That being said route53 has a reasonable management API.

~~~
cat199
thanks - should have mentioned specifically not looking at cloud services

(e.g. self hosted, but without needing 5 different polyglot microservices and
a service managment layer and 32GB of ram just to keep the whole mess running)

I see 0% need for this complexity in many cases on the presentation side - and
if faster response is required internally, the same API IF can be used for
service discovery or side-chain announcements, etc can be bolted on on a per-
application basis if desired.

I also see 0% need for this to be a cloud exclusive domain - e.g. hybrid
scope/location deployments, etc.

------
HeadlessChild
A configuration manager, Ansible for example. You basically describe your
infrastructure with it.

------
nunez
I deploy it with code. For hardware stuff, a CMDB, also maintained with code

------
hypnagogicjerk
What about securely storing credentials and passwords?

------
dxhdr
I'm curious how ChatOps practitioners handle this.

------
AdamGibbins
I'm not sure I understand your question fully? You write documentation, like
you do anything. And configure everything with code, so you can go read it
(Terraform, Chef/Puppet/Ansible, etc).

~~~
castis
OP is probably looking for someone to go a little further into detail on
exactly what you just said.

~~~
pnutjam
I use SCC (System Configuration Collector) to document our servers. Everything
else is just a collection of grep-able text files on our management server.
[https://sourceforge.net/projects/sysconfcollect/](https://sourceforge.net/projects/sysconfcollect/)

