
GitHub's Metal Cloud - samlambert
http://githubengineering.com/githubs-metal-cloud/
======
alpb
> We've hacked together a Ruby script that retrieves a console screenshot via
> IPMI and checks the color in the image to determine if we've hit a failure
> or not.

That's pretty funny yet sounds a lot familiar to many of us as every now and
then we all do these sort of nasty hacks.

~~~
SEJeff
Vs just buying opengear console servers and something like conserver to get
the actual text via serial like most large Unix environments (every job I've
worked at) do. Then you can just scrape the text.

~~~
FireBeyond
Right, this surprised me. They're doing IPMI, so why not SOL (Serial over LAN)
to get the raw stream?

Actually, on second thoughts, I'm not surprised so much. I've read many
threads on Supermicro IPMI and people's frustration with it (reliant on
outdated Java, and hacked together wrappers over VNC) that make it seem like a
deliberate choice to obfuscate things -just- enough to make other tools
difficult.

~~~
SEJeff
No a career building those types of tools taught me SoL is garbage from most
vendors. Cray, Dell, and HP are (arguably) best with mostly reliable SoL, but
they still are awful. If you paste a buffer too big into a SoL session, the
dell DRAC will freeze, so you have to kill and restart the serial connection.
If you have > 1000 machines, hardware serial is the best thing to do for
management, in addition to IPMI for power management.

------
mverwijs
We did something similar at Optiver.

    
    
        * boot a custom live cd (a la Knoppix) over PXE
        * Live CD places node into database if it doesn't exist yet, 
        using dmidecode to find serialnumbers and such
        * Live Cd keeps querying database for instructions
        * Engineer adds a profile to the node in the database
        * Live Cd slices up disk to match the profile
        * Live Cd fetches a tarball of the base OS from an URL 
        and throws it on the metal. Runs grub setup. Reboots.
    

Tumblr did something like this with
[http://tumblr.github.io/collins/](http://tumblr.github.io/collins/).

~~~
voltagex_
Hey cool so what I'm fiddling around with in iPXE isn't completely obsolete in
the age of Docker and disposable VMs.

~~~
mverwijs
Well, no. But IMHO, iron is only do-able if you have enough of it. This I
learned the hard way at my previous contract, where we only had a handful of
servers and all of them production.

When you want to automate your complete infra, including rolling out hardware,
you need hardware to develop and test on. Entropy of life will ensure that
exactly that moment you need to reinstall that PostgreSQL slave from scratch,
the PXE server is unreachable, or the server has a different diskcontroller,
or an iLO certificate expired. Or something stupid.

Test your code. And for Ops this means: machines that are solely for the
testing pleasure of Ops. No other function.

~~~
voltagex_
In a perfect world you'd have a Dev, Staging and Prod PXE server, but reality
says you're not going to be able to get signoff to run that many.

~~~
mverwijs
But that is perfectly fine! That only means that you cannot have an automated
install procedure that the company can rely on. There is really nothing wrong
with a little manual labour at this stage. Do not spend weeks and weeks on
automating this without having the environment to test your code.

------
seanhandley
Sounds a lot like [http://theforeman.org/](http://theforeman.org/)

------
ymse
Hardware provisioning is a dying art. I would love to see a modern-day xCAT[0]
clone that's easy to install and configure and with proper multi-platform
support. Foreman is half-way there, but AFAIK doesn't do BMC provisioning and
discovery, which is a big deal.

0: [https://github.com/xcat2/xcat-
core/blob/master/docs/source/i...](https://github.com/xcat2/xcat-
core/blob/master/docs/source/index.rst)

~~~
eLobato
Foreman does discovery [1] (PXE, PXE-less, segmented networks, through
bootdisk) and handles BMC, I wrote the API in fact :)

[http://theforeman.org/plugins/foreman_discovery/4.1/index.ht...](http://theforeman.org/plugins/foreman_discovery/4.1/index.html)

Maybe you were not aware because it's a plugin, we kind of have that problem
in the Foreman community, plugins are not as visible as they should and they
can contain key features.

------
danesparza
I've known about Hubot for a while.

But did anybody else see Hubot with a Santa hat and think that was adorable?
Because I did.

------
ptype
Why would a company like GitHub choose Ubuntu over Debian? The LTS policy?

~~~
wtbob
I too would be interested in the answer. From my perspective, Debian is the
server Linux distro _par excellence_ , and in my experience the folks who
choose Ubuntu have been devs who don't actually use Linux (e.g., the sorts who
develop on a Mac or in a VM rather than on a personal Linux system). It's not
really fair to Ubuntu, which is decent enough in its own way, but I tend to
consider the choice of Ubuntu to be a bit of an architecture smell.

I'm honestly interested in what the valid reasons to prefer Ubuntu over Debian
(particularly on the server side) are.

~~~
q3k
Newer packages, sane LTS policy, easier to get non-free firmware/drivers going
(as in, the default CD comes with them), seemingly more support from third
parties.

They're both pretty awful due to their automagic(al tendency to break down in
mysterious ways), but if I have to choose, I'll go with Ubuntu.

Disclaimer: My main box runs Gentoo and I own no Mac machines, if that changes
anything in your vision of Ubuntu users.

~~~
zeveb
> Newer packages, sane LTS policy

Those two are in opposition: Debian (generally) has new-enough packages, but
it's _stable_ , which is what one wants on a server system. Meanwhile,
Debian's LTS story is better than Ubuntu's: just upgrade, and know it will
work.

> easier to get non-free firmware/drivers

But how often is that needed for server systems? And of course, there're the
ethical & engineering issues with using proprietary software in the first
place.

> seemingly more support from third parties

There is that, but if we all wanted more support from third parties, we'd have
stuck with Windows, no?

> They're both pretty awful due to their automagic(al tendency to break down
> in mysterious ways)

I've not experienced that with Debian in a long time. I used to have issues
with Ubuntu, but I don't think that they were generally all that bad. Better
than what I used to experience with Macs and Windows back in the 90s, anyway.

~~~
FireBeyond
"stable". "new-enough"

This is a leading word. As is a lot of that paragraph. For many tools some
companies use, Debian certainly is NOT "new-enough" with many package choices.
Nor is Ubuntu inherently NOT "stable" \- and still trails a little behind the
leading edge. As for upgrades, I've watched many a server upgrade seamlessly
from 10.04 LTS to 12.04 to 14.04. I'm sure there can be and has been many a
person, many a thread who've not had seamless experiences. But the same
applies for Debian - heck, even the release manual has a section entitled "How
To Recover A Broken System" with reference to system upgrades.

"non-free firmware/drivers"

How often needed? In this article alone, IPMI, BMC, RAID, BIOS.

"And of course, there're the ethical & engineering issues"

This is a derailment. What exactly are the ethical issues for a closed source
company in using other proprietary software?

I'm by no means an Ubuntu fanatic. It has its share of issues, absolutely. I
have everything from FreeBSD to Debian to RHEL to OmniOS to administer, and
they all have strengths and weaknesses.

------
castell
The Hubot workflow sounds interesting. It seems more and more DevOps prefer
it.

Has someone first hand experience with such Hubot usage? Do you prefer such
commands or would you want to write more informal short sentences?

~~~
ktt
Unfortunately most of the material on ChatOps currently covers only how to get
Hubot to display cat pictures or other trivia [1]. Maybe it's because each
company should create their own "chat API" but I'd also like to hear some
real, inside "war stories".

Does anyone knows what app does GitHub use for chats? Looks like a simple and
elegant UI over Basecamp.

[1]: [http://hubot-script-catalog.herokuapp.com/](http://hubot-script-
catalog.herokuapp.com/)

~~~
tehbeard
Going off Hubot's sourcecode, I'd guess campfire.

------
stephenr
These lines seem odd to me, maybe it's just the wording:

> [gPanel] Deploying DNS via Heaven... > hubot is deploying dns/master
> (deadbeef) to production. > hubot's production deployment of dns/master
> (deadbeef) is done! (6s)

Is this just an IMO odd use of the word "deploying" or does a DNS change
really mean building and deploying a new package/image?

~~~
_yy
They probably manage DNS in a Git repository/using Puppet, so deploying may be
quite literal. I see no issue with that.

~~~
mrmondo
We do this, it works well do us with 200-300 servers.

------
amalag
Is this also done remotely by gPanel?

>Once we've gathered all the information we need about the machine, we enter
configuring, where we assign a static IP address to the IPMI interface and
tweak our BIOS settings. From there we move to firmware_upgrade where we
update FCB, BMC, BIOS, RAID, and any other firmware we'd like to manage on the
system.

~~~
FireBeyond
In theory it should be if you have a tightly controlled hardware process (and
in this case, Dell, who is used to selling servers configured to initially PXE
boot, etc), and you have some 'expect/send' scripting in place.

~~~
amalag
I was not aware you can control BIOS settings, BIOS upgrades and configuring
IPMI remotely like that.

EDIT - looks like it is straightforward if you control the IPMI locally. So
the software would send commands to do it locally.

------
dorfsmay
Not as detailled on the tooling, but it sows how much hardware they use,
Stackoverflow did a blog post on their datacentres move:

[http://blog.serverfault.com/2015/03/](http://blog.serverfault.com/2015/03/)

------
A_Beer_Clinked
Can anybody with experience of using Openstack Ironic[1] in this space comment
on advantages of rolling your own liek GitHub?

[1]
[https://wiki.openstack.org/wiki/Ironic](https://wiki.openstack.org/wiki/Ironic)

~~~
dmourati
A consultant from Ubuntu called Ironic a "still birth" and said its name was
indicative of its fit with the rest of OpenStack. His team used MaaS which I
gather serves most of the same purpose:

[https://maas.ubuntu.com/](https://maas.ubuntu.com/)

~~~
stephenr
Breaking news: A consultant for a company trash talks an open source
competitor.

------
grandalf
This is pretty cool but is there really that much benefit to doing this? What
size IT staff do the savings justify? Does that change if you use Amazon's
market-driven options like spot pricing and capacity planning discounts?

~~~
scott_karana
The equivalent of a M4.x10large running 24/7, which would cost $1814/month,
costs about $400-$500/month lease from Dell or HP.

There are other costs, like cooling, power, peering, networking gear,
colocation/building costs, having spare parts on hand, paying sysadmins, et
cetera, that are going to vary based on your requirements and region.

------
sargun
This seems wrong to me. This was the state of the art ~3 years ago. Now, I
feel like all of the machines should be provisioned already with an OS, and a
basic image, and a orchestration system like CoreOS / Mesos / Docker should
specialize them.

IMHO, requiring hardware, or the entire machine should be exception, not the
rule.

~~~
mverwijs
I'm surprised everyone is still installing to disk. In 2008/2009, we had a POC
where we ran the OS from memory after booting over PXE.

    
    
       * boot a live image into memory
       * point LXD/RKT/Docker to /containers
       * ...
       * profit!

~~~
iofj
The big issue I'm having with that is that it involves trusting vendors to get
network boot right. Especially when it comes to the looping part of "loop
until DHCP gets a response" it becomes a problem. One of the cheap vendor
tries 30 times and then goes to a boot failed screen after trying the disk.

Also, 1 time out of a 4-5000 or so network boot fails. Not sure why.

~~~
vidarh
If you have IPMI on the server this doesn't become such a big problem - you
can reasonably trigger resets/reboots if it's not up after a given amount of
time.

~~~
iofj
We buy the cheapest server that meets our needs, and buy it in somewhat larger
quantities (often double what was originally envisioned for less than was
originally budgeted). Much more efficient.

But it does mean no IPMI. However I built a small circuit that sits on a power
cable that can interrupt said power cable with a relais that sits on a bus
plugged into our server, so we can do the reboot thing.

I've been meaning to redo that power cable circuit using wifi as the linking
technology, now that we have esp8266 available.

