It's also worth noting that some widely-used pieces of software cache DNS lookups in-process (MongoDB), so changing a DNS name is no guarantee that all connected processes will automatically fail-over to the new machine unless restarted.
Moral of the story: distributed systems is hard
I know, I know, I hate it too. We're working on it. Just wanted to share the workaround.
Speculating here, I suspect that an important Sun or IBM customer had an app that did lots of DNS lookups and the performance stunk. So, an engineer did a quick 'fix' to cache DNS lookups. Customer was happy, everyone moved on. Some time later this quick fix got ported into the mainline code base. But, it appears that nobody did a proper analysis of this quick fix, ie. respect TTL on DNS. Maybe supporting TTL wasn't important because this was back in the early days of Java when it was trying to win the desktop war and desktop apps weren't really expected to be long lived processes.
I understand Java doing 'its own thing', because the goal is to provide consistent behaviour on all platforms, but it shouldn't be stupid behaviour.
Otherwise you won't be able to get that classifying information on the end host.
We actually studied this for a while and MAC address is a pretty universal ID that works virtually anywhere. We use the smallest physical non-zero MAC address in case of multiple NICs. We considered using chassis or baseboard serial numbers but it gets too vendor specific.
if your developer or sysadmin assumes that the server named castle will always be something special instead of looking in the CMDB or other ENC for that information you will have less fun in the long run.
tl;dr random names with a central ENC forces you to only get meaning or facts from one central repo.
With logical mappings, you just use chef/puppet/cmdb to do something like sshnode prod app 1 to connect to the first production app node or sshnode prod app to connect to all the production app nodes. CNAMES can do this too but then you run into potential DNS consistency issues. With this you get the human friendly abstraction and keep the machine friendly determinism.
Of course when your chef server goes down this can be a PITA (I recommend building in some knife search caching!)
Ours are named for location and function. There are some large servers which have six character names as they so unique/important to the organization there is never a question. Usually though many fall into <owner+location+environment)
However I can see justifications made in either direction, its no worse than coding conventions that people come up with.
If you start building a small cluster which you know is going to grow into a big cluster, then these aren't really premature optimizations so much as laying a proper foundation. It's the difference between rewriting some of your Python in C and making sure all your python code is packaged into sensible objects and grouped into libraries.
It's one of the two hard problems, you know. Help is always appreciated.
He noted the off-by-one errors as a concern.
I was thinking: this is untested, unreviewed, 2-minute code that shows I can at least do some sort of problem solving. I'm nervous, don't know you, kinda worried about the tone of this interview and you're worried about off-by-one errors.
Kinda glad I didn't get that one. Wish I'd just hung up on the guy.
The problem with getting praised a lot for being smart is that you then feel obliged to say smart-sounding things often, and being critical is an easy way to sound smart. It's even worse if you're hired to be smart. So the guy, if he wasn't a total tool, was probably thinking: I wish I could just be coding, but instead I'm supposed to interview a bunch of people in a way that guarantees it will be awkward. And shit, I just got nervous and said something that sounds insulting. I wish I could just hang up.
From the linked Wikipedia article:
The approved words can only be used according to their specified meaning. For
example, the word "close" can only be used in one of two meanings:
- To move together, or to move to a position that stops or prevents materials
from going in or out
- To operate a circuit breaker to make an electrical circuit
The verb can express close a door or close a circuit, but cannot be used in
other senses (for example to close the meeting or to close a business).
I personally believe that servers should not have names, as naming them just creates the incentive to try to fix broken infra instead of just killing and respawning. I've considered using some kind of short UUID for hostnames instead.
Assuming you meant "short ID", how is that any different than the OP's suggestion of randomly choosing from a list of meaningless names?
Corollary: in a room of 23 randomly selected individuals, there's a 1/2 chance that two individuals share a birthday. 365 days in a year, but only 23 samples gets you to 1/2 chance of collision. Look up the "birthday paradox" for more info on this surprising result.
For a 50% chance of collision you need Math.pow(2,32/2) calls to your ID generating function which is 65536 calls.
And that's just a 50% chance. Even a 1% chance is too much (for which you would need much less than 65536 calls).
I doubt "IBM, Google or Microsoft" are using 32bit UUIDs.
"4 billion IDs is likely Unique within an organization"
I don't think it's unique within any organization.
a1, a2, ... aN for serves 2 cores 8GB ram
b1, b2, ... bN for serves 4 cores 8GB ram
c1, c2, ... bN for serves 4 cores 16GB ram
"Sure Frank. What's its host tag name?"
If the hostname is randomized, then the logical thing would be to copy and paste the hostname and send it off over the wire, which would theoretically cut down on human error.
It's annoying enough to try to find "snowwhite" amongst 40 servers in each of 40 racks; trying to scan the list of names for "ABC123XYZ987" when none of the name correlates with any useful information would be absolutely maddening. Jesus christ, I can just imagine for every sever looking at the name on a sticky note, looking at a tag name, and then looking again at the sticky note and back to the tag name, possibly a third time, just to make sure it wasn't off by one number or letter or something. Repeat for thousands of servers? Euuggh!
Besides the poor cage monkey having to deal with that crap, when you're fighting fires in the ops room you need to be able to refer to a handful of server names quickly. You can't be copying-and-pasting names to people, looking information up every 2 seconds because you can't possibly memorize it. We are human beings and so we need human interfaces to information.
I've done this, and it works very, very well with third parties. You simply give them their work-order, and it consists of working on particular physical devices.
Row 10, Rack 11, RU-15 - Serial Number 103527382 - Replace Hard Drive.
Row 13, Rack 9, RU 6 - Serial Number 103528942 - Replace Power Supply.
The work orders are generated from your CMDB, which tracks things like serial numbers and physical locations.
As organizations grow, this is eventually where they all end up (after multiple iterations of other less scalable systems)
The human interface problem is solved by physical directions, not type of data provided.
In the above case, the question was how to direct technicians to gear - and, specifying a physical location and a faceplate label (Serial Number or what have you) does the job.
In the case of Humans - they usually have cnames for the function they are interested in anyways.
I do work on about 4500 servers - and I couldn't tell you the name of a single one of them (though, I note they have some long convoluted DNS PTR, even harder to understand than a serial number) - But, I have a ton of cnames when I want to login to a particular customers server (based on customer, production/test/dev/fste, function).
Teams that need to do maintenance on servers have tools that group them based on role, location, data center, etc... They never actually "ssh" into a server the way I might.
A lot depends on whether you are talking about scales of 100+, 1000+, or 10,000+ servers (or network devices for that matter) At each stage you start to lose more and more of human naming convention, and move everything into a Configuration Management Database (CMDB)
dig +short 4test.shephard.org
That was part of the reason for that curated list of words that they were recommending in the article.
> If the hostname is randomized, then the logical thing would be to copy and paste the hostname and send it off over the wire
Eventually that chain of communication will need to end in an action. Not all actions are just "copy-paste it into a box." What if I know that the server is a "Rack #3", but now I need to look at labels on the actual servers to find server 'SDFssdfa4324tdfgfg"? It's a lot easier to grok server "crimson".
Bob: "Frank? Which server did you say was on fire?"
Frank: "It's the email server. The name is ..."
Bob: "Frank - you know the rule. Send it, don't say it."
Frank: "Uh ... I can ... maybe text it to you?"
Frank: "Uh ... pigeon?"
Bob: "The intern ate the last one."
Frank: "Uh, carve it on the intern's back?"
Bob: "Tattoo it. Safer that way."
Frank: "OK, give me an hour."
Bob: "Hey, server's burning, you know. Ain't gonna last forever ..."
Keeping your hostnames secret, like dropping ICMP, is useless.
However, I personally question the wisdom of giving a network intruder a roadmap to your internal network via easily predicted DNS names, but that's probably just me being paranoid.
I used five-part DNS CNAMEs exactly how they're specified in the article (a bit shocking, actually) except that mine switch the "group name" (prod, staging, CI, etc.) and the geography, with the rationale that "groups" can span several geographic areas (example: a multi-regional production environment), so they're logically the "containers" of geographic regions.
Example: rtr01.us-west-2.prod.crittercism.com, NOT rtr01.prod.us-west-2.crittercism.com
Do you manage 12 machines in a closet, 200 EC2 instances, or 1,500 systems in 5 datacenters?
Do you have more than one datacenter in a given city? in a given campus? in a given building?
Do you build systems that serve a single purpose such as a database, or do you run multiple daemons on a single system? How do you take docker/containerization into account?
How often are your sysadmins logging into these systems? Once a week (spell out "production"), or 20 times a day (abbreviate it as "prod")?
Are your systems all owned and operated by you? Do you manage systems for multiple clients in the same datacenter?
Does your team consist entirely of your college roommates you started your company with? Or are you a team of 20 spread out across different time zones and different countries? Will your funny/memorable names be offensive?
This all seems like an exercise in trying to build the The Server Naming Tower of Babel Bikeshed.
Report, using the "command" module, free memory on all "nyc-web" hosts in our "prod-hosts":
ansible -i prod-hosts nyc-web -m command -a 'free -m'
- hosts: web
ansible-playbook -i prod-hosts web.yml
ansible-playbook -i prod-hosts web.yml --limit nyc
ansible-playbook -i prod-hosts web.yml --limit 'nyc:&10.0.2.21'
I use Pokemon. What do you use?
Maybe find molecules with the corresponding molar mass, e.g. glucose = C6H12O6 = 180?
Machete, Vega, Jack, Oscar, Angel, Tattoo, Tigre, Carlos, Vic, Reynaldo, Guerrero, etc...
On the other hand, choosing a name for what is running on that server makes it just one step easier for an attacker?
If attackers ARE in your DC already, you're already hosed, and the few minutes that it would take them to determine that some obscure name like "host-a831f1" is the DB won't matter.
So in general, I believe optimizing for maintainability here (easier names) is more worth it than falsely believing that obscuring the names provides some level of security.
Though, if all your servers are able to be accessed from the public internet, it might be a different story. But that really isn't recommended.
That's a defeatist attitude, and the reason why security companies get away with only selling perimeter defense products. "Well if they get in they can do whatever they want anyways." If servers are properly insulated from one another, violating a single server won't give them complete access to your infrastructure.
An example of this: Valve was infiltrated by a hacker that managed to exploit an ASP server for a random webpage, and was able to get all the way to Valve's perforce servers and steal a copy of a the tree for Half-Life 2. There's no reason in hell a random web exposed server should be on the same network as their Perforce server, but that's what having poor internal network security does for you.
Knowing the name "main.prd.example.com" doesn't help if it's got a bastion host, thorough firewall rules, key-only SSH login, et cetera.
Yes, I am in total agreement with you. Much appreciated.
Name your machines XXYYN[N] where XX is the building name, YY is the rack name and NN is the position in the rack? You'll eventually have more than 676 buildings, 676 racks, or 99 machines in a rack. Multiple people will have written regexes to split that into building, rack, position and they will break when you grow one of the fields.
And I'd argue that building numbers probably don't want to be part of your scheme.
For the beefy host machines, I just go with core-a, core-b, core-c etc plus a geo tag. So: 'core-a.de.domain.local', 'core-b.de.domain.local' etc.
For the vms, they get functional names like 'n0-dbc1.de.domain.local' (for node0, database cluster 1) or 'git.de.domain.local' etc.
Only the last two digits of each guest vms mac address (or addresses if it has multiple eth interfaces) changes. The rest is tied to the underlying host machine. So e.g. 00:50:56:01:01:06 means a mac for a guest on 'core-a', 00:50:56:02:02:03 means a mac for a guest on 'core-b' etc.
All internal non-publicly routable IPs use our domain and the '.local' TLD (with bind serving our internal network). All external, publicly routable IPs use our domain and the .net TLD. Finally, our frontend website IPs use our domain and the .com TLD.
The above wont scale well beyond one or two hundred vms per data centre location but if we ever need more than that, it will be a nice problem to have :)
It helps that we don't filter on mac. The main reason for tieing the addresses to the underlying core is to avoid accidentally allocating the same mac to two different vms (since at the moment vm creation is still a relatively manual process for us - hoping to improve this in the future though!).
The words you speak should be the words you use, not abbreviations of them. You don't say prd, you say prod, you don't say tst, you say test, so prod and test are what you should use. One ubiquitous language everyone can share without the burden of useless jargon. Dropping out vowels isn't saving you anything.
Except the consistency thing I was talking about. In one of previous job at a fortune 100 company had to have a consistent naming scheme for everything in the DC for whatever reason that predated me. 3 letter production environment, followed by a 4 digit number which described all kinds of things about what apps / environments ran on the host.
Also having fixed with fields for hosts is very valuable when you are writing scripts to parse data on hostnames, especially when you are scaling up to tens / hundreds of thousands of hosts you begin to appreciate consistency.
Different story for small networks, have fun all you want. With hundreds of thousands of hosts scattered all over the world, this isn't used.
I'm just pointing out the fact that billion dollar real world orgs use abbreviated hostnames more often than not, and they have good reasons to do so.
You can have short server names that are easy to type AND still use actual words. Typing prd instead of prod is not saving you anything worth saving nor making server names easier to type or remember no matter how many there are.
> and they have good reasons to do so.
I usually suggest that people read RFC 1178, Choosing a Name for Your Computer: http://tools.ietf.org/html/rfc1178
At least then they will avoid the usual pitfalls worked out over the many years in which networked computers has existed.
In a modern world where the machines might be homogeneous and VMs/containers above define the actual role, the machine might just be "host123" whereas the higher level services have specific names like "db001."
With service cataloging and discovery tools like Consul (disclaimer: I wrote it), there is an easy way to see a mapping of service name back down to the host it is on. So even if you're yelling to an ops person "hey db001 is having problems," the ops person can quickly map db001 down to host 319.
And with slightly more complex (but worth it, imo) naming, you can determine the rack, datacenter, etc. of a server just by the name.
Especially given that the name changes at every startup. It's absolutely not recommended to stick with the auto-assigned names, but a rather good idea to name the node after the machine or the location.
Naming is very important and often done poorly. The positional scheme you link to works very well.
You need to label things, and bear in mind that you might be asking some outsourced third party "pair of hands" who has no knowleged of your environment to reboot a box - you want to be sure it's the right one!
Active equipment like servers often have naming conventions around function etc, but consider what happens if someone remotely renames the server and its physical label is ot updated...
For passive equipment, e.g. patch panels, we recommended a suffix indicating U position (being consistent about whether U's coount from the bottom, and which edge defines the U position). Most places start with U1 as the bottom U position, and equipment that might take up Us 1&2 is defined as being in U1 (bottom edge). Alternatives OK, but be clear and consistent.
Many organisations are not definitive on location ids, down to city, name of building within the city, even the names of rooms within a building - multipe names often used!
Examples might be:
(without U suffix is fine for such kit often)
Lots of fun in this area!
One of the many things I like about my ISP is that they apparently feel the same:
traceroute to 220.127.116.11 (18.104.22.168), 30 hops max, 60 byte packets
1 X.X.X.X (X.X.X.X) 7.057 ms 8.826 ms 8.859 ms
2 c.gormless.thn.aa.net.uk (22.214.171.124) 26.691 ms 28.136 ms 27.243 ms
3 c.aimless.thn.aa.net.uk (126.96.36.199) 29.673 ms 30.793 ms 30.273 ms
example.com, *.example.com, *.*.example.com, *.*.*.example.com [...]
Having a "mail server" might be fine for lots of places, but having disparate machines with specific functions is not invalid.
Essentially, the hostname should not have any indication of the host’s purpose or function, but instead acts as a permanent, unique identifier to reference a particular piece of hardware throughout its lifecycle
It follows that if you are using virtual machines and configuration management, where the lifecycle of a VM or amazon instance is trivial compared to an actual piece of hardware, there is less benefit to separating hostname from functional name.
So for example: sea1r001u01. Our team in the data centers provide the initial name based on where they place the server. The unit value starts at 01 at the top of the rack and increases sequentially.
This means that if we have a hardware issue, the data center teams only need the hostname to know exactly where the server is located and get to it.
Firstly all VMs are named after their function(AD01, named01 etc) we never re-purpose VMs
Workstations went through a few conventions dependent on size:
o french revolutionaries
o generals from WW2
o Fellows of the royal society
o Towns of the UK
My favorite is fellows of the RS, as a lot of them have wiki pages.
The key thing is that you don't recycle names. When a machine dies, so does the hostname
1) Join the vowel generation. If your tools cannot handle unlimited length names, use different tools.
2) Use descriptive names. Don't name them after your pets.
3) Rigid rules like Hungarian Notation produce cryptic and irrelevant noise.
Why? Because the functions served by this machine may change over time -- so any descriptive name you use may become confusingly inaccurate.
But you need a way to refer to the _machine_ itself, whose functions may change over time. The best way to do this is to pick a _non-descriptive_ unique identifier that will always refer to that machine. As the OP mentions, that non-descriptive unique identifier "will mostly be useful to operations engineers, remote hands, and for record keeping."
Then you make a descriptive name as a CNAME. Which, they don't mention, but it might change over time what that CNAME points to etc.
I have found this general principle to be a very good one even in my much smaller shop. When we used descriptive names as the 'main' canonical machine name, this led to huge confusion when roles changed -- you either had descriptive names that were no longer accurate or confusing, or you changed them but still had documentation or notes or tickets referring to the old name, etc.
The rest of the OP goes into a suggested scheme for making the descriptive names (the CNAME's), in a way that will actually _be_ descriptive of what developers and ops will need to know. Their scheme seems pretty decent to me, but definitely depends on the particular context and domain of the shop, including how many machines you have, if they are geographically dispersed, etc.
But the basic concept of using a _non-descriptive_ name as the basic machine name A record, with descriptive names being CNAMES to it -- is I think pretty widely applicable and wise. (And that mnemonic projet list is pretty useful for creating non-descriptive unique identifiers that are still easy to remember and record).
Also it's useful to put the dc and rack number in subnets after the hostname (hpbl7x0001.r55.la.example.com). The next time you're running around the cage reading every single 1U's tag name going "AARRRGHFHH WHERE THE FUCK IS SLIMSHADY.EXAMPLE.COM??", you'll thank me.
(i have no idea why, but it's much easier to get a dns admin to create a new A record than it is to get a NOC employee to update the network inventory database)
(also: good luck using your network inventory db to look up a rack location when the rack/switch that's hosting the network inventory db is down...)
> (i have no idea why, but it's much easier to get a dns admin to create a new A record than it is to get a NOC employee to update the network inventory database)
Well yeah, if parts of your organization are dysfunctional then you find ways to hack their functions into the more functional parts of your organization. But that doesn't make it the right approach in general.
apt-get install beep
This short form is "for historical reasons". That's not "they're too lazy to fix it" reasons, rather there is a very large amount of hardware deployed elsewhere which you might have to interact with making assumptions. You have to fit within the lowest common denominator of all those assumptions.
> If your tools cannot handle unlimited length names, use different tools.
Because that's not possible. According to the RFC, FQDNs are are limited to 255 characters. Individual components (i.e. between the dots) are limited to 64 characters. Having to account
for e.g. IDN, means space limits are an active concern.
To generate one during the setup of the server automatically
For clouds I use the cloud and zone/region:
(Why can't I escape * 's? ....)
fwl, ups, pdu, rtr, swt, etc... all really give away too much info imho.
Yes, someone may discover what the device is on their own via nmap -O or something, but telling someone up front this is a PDU and if you mess with it, it may crash an entire cabinet... is just... silly.
I tend to follow the parent comment's suggestion more, labeling by location and environment (prod, dev, etc).
Remember that the reverse dns always resolves to something like orange.example.com, which gives away no information at all.
We ended up chatting -- but for further clarification for anyone else -- yes, you can have additional IPs if you can justify them. There are practical limits, but we haven't come across a situation where we've needed to enforce a hard limit at this point.
Given that in some cases you might be typing them very frequently, some people would want to shorten them. I assume it's more to do with typing speed than saving bytes somewhere.
Another good thing to do, I was told that according to RFC1413, always make sure you are using identd service, it's a very helpful protocol.
I just don't get why people are so paranoid on the internet I've been a sys admin since the 90's and never once had a virus or been hacked.
I guess with everything now being a "devops" world, we should just focus entirely on convenience and forget about the old timers annoying speeches on "SPOF" and "Security/Risk Management" that block me from pushing my awesome new codes to production as quickly as possible. We use Chef and CI so I don't ever have to even think about the server itself, other that it is running Ubuntu, which is super fast and a very secure OS.
I love when new software comes out, I just grab the recipe and ship it! Even if it is geared to server farms with hundreds of machines, I probably need to use it for my 5 server cluster.
Anyway, sorry to get off topic.
As to the security benefits of configuration management tools, again, speaking as an infosec guy, I love them. I can push a locked-down, default-deny base config out to all of the computers under my care, and I don't have to worry about making a mistake or missing a step or forgetting to document something. I can work with the devops team to set up automated functional and security testing using a continuous integration tool, so that config changes (including security updates!) get vetted automatically in a development or test environment before being pushed to production. I can put the whole config under revision control, so if there is a service or security incident due to a config change, we can figure out how our development and testing processes failed us - and how we can improve them.
And finally, speaking as a ardent FreeBSD user, I wholly agree that Ubuntu is utter rubbish.
I will grant you that it may come down to differences in our threat models. In my case attacks are impersonal - it's the malware or botnet /du jour/ that I have to deal with on a regular basis, versus the kind of APT facing journalists, civil rights organizations, or militaries. Even then I'm not sure that naming conventions that leak less information than a basic port scan will slow APT down. Too, the administrative overhead caused by forcing sysadmins to constantly go to a CMDB just to do basic troubleshooting might be a cost targetted organizations would be willing to pay. In my case, we run really, really lean, so we do what we can to make our I.T. services self-documenting (naming and numbering conventions that have meaning across multiple network layers).