
Homegrown DevOps Tools at Stack Exchange - KyleBrandt
http://blog.serverfault.com/2013/09/05/homegrown-devops-tools-at-stack-exchange/
======
WestCoastJustin
I think Stack Exchange has a secret weapon which will likely greatly improve
their backend systems. Tom Limoncelli, who used to work at Google as a Site
reliability engineer (SRE), now works at Stack Exchange [1]. He pretty much
wrote the bible for sysadmins entitle " _The Practice of System and Network
Administration_ " [2]. I wouldn't be surprised if we start seeing more posts
like this!

[1] [http://everythingsysadmin.com/2013/09/the-team-im-on-at-
stac...](http://everythingsysadmin.com/2013/09/the-team-im-on-at-stack-
exchan.html)

[2]
[http://www.amazon.com/dp/0321492668/tomontime-20](http://www.amazon.com/dp/0321492668/tomontime-20)

~~~
Fuxy
Wow that's cool I would love to work with those guys wouldn't be much help
though only have basic Linux administration skills.

~~~
gsands
Keep learning and soon you will be beyond basic.

If you have passion, you will have plenty to offer.

------
lifeisstillgood
This oddly sounds like a death knell for Windows. I am not seeing anything
here that is not in the LAMP / OSS stack as standard (graphite, nagios /
Munin)

If we rephrase the blog post as "we could not find any good tools in the
Windows devops space so we wrote them" and add it to the departure of the only
CEO willing to dance on stage chanting developers developers developers and
Windows is not an ecosystem but a hub with a few brave outlying satellites.

I am impressed by the stacke change folks and their story and skills but it
feels like amazing stories of software skill written for one company and never
released in the open - it just leaves no legacy

~~~
pnathan
In my experience, Windows tooling for serious admin work is either terrible or
massively expensive/enterprise.

For instance, a key problem I had in a prior gig was that I needed to
automatically log into a Windows machine, run a job, and then log out. Pretty
bog standard; didn't need careful error recovery or anything particularly
sophisticated.

In Linux, you configure your SSH keys, then ssh automatedjob@server -c "./run-
my-thing", and that's that. I literally could not find any identical analog in
Windows besides _telnet_ (if anyone knows of a solution here that approximates
the Linux one for simplicity, I'd love to know about it). Today I'd probably
just requisition a copy of a Windows SSH server and be done with the sorry
mess. Better yet, throw Windows out and go full Linux. ;-)

~~~
KyleBrandt
Windows has advanced in a lot of ways over the past several years. We hired
Steven Murawski a little while back (Powershell MVP) and he has been able to
automate just as much as would expect in the Unix world.

He is also priming out infrastructure for Desired State Configuration
(configuration management (like Puppet/Chef) for Windows).

~~~
pnathan
What I'm deriving from your statement there is, "So we hired a guru to bring
our Windows systems up to par with Linux base point".

I'm glad its working for you, I use stackexchange daily and am generally quite
happy with it!

~~~
stevenjmurawski
Not at all. The Windows OS now has command line accessible management points
that are similar to Linux. There is still a great deal of difference in their
management models (I blogged about this a while back -
[http://blog.serverfault.com/2013/06/03/cross-platform-
config...](http://blog.serverfault.com/2013/06/03/cross-platform-
configuration-management-is-hard/))

That I was hired as a Windows specialist was so that we can go deeper on the
OS side and the PowerShell side, just as we have a Linux expert to go deeper
on the Linux side. Our sysadmin team was just more tilted with experience on
the Linux side (though you wouldn't know it - as almost everyone I work with
would qualify as a senior admin in any Windows shop in the world).

~~~
pnathan
Steven,

Thank you for your response.

I am glad to see that there are comparable capabilities in the modern Windows
world and will dig into the WMI side of things next time Windows admin tasks
come up.

------
benjaminwootton
It's always nice to see new products in the DevOps space, but be careful not
to re-invent the wheel if you do this kind of stuff as the open source world
is coming on leaps and bounds.

LogStash, ElasticSearch and Kibana are a great open source stack for log
management.

StatsD and Graphite are nice tools for metric tracking and visualization.

There are lots of open source dashboard offerings which combined with a bit of
scripting can get you far.

You are also spoiled for choice with SAAS monitoring stuff such as NewRelic
and Server Density, even if the OP isn't a fan of cloud based tools.

~~~
KyleBrandt
I did an experiment with logstash, elasticsearch, and Kibana for our HAProxy
logs. The default with logstash was to store each field, and then the whole
text so stuff go quite large for our web logs. Also the Kibana interface was
pretty buggy. Parsing our logs (~2k entries) a second doesn't work well with
Regex so in our version we do a bunch of substring stuff. I'm excited about
Kibana/ES for the rest of our logs over with their recent hire.

When I looked at StatsD and Graphite last time I didn't really see an API. I
really like the model of data being queryable and returning nice serialized
format like json (like OpenTSDB does). I'm also not that fond of the "many
files" model and the automatic data summarization as it ages (it does save
space, but makes forecasting difficult as it can skew data).

~~~
zwily
We're parsing 2k lines/sec with logstash using regexes. We scaled it out to 6
logstash processes across 2 nodes. They pull off a shared redis queue and then
insert the results directly into elasticsearch. (That said, I'd like to
configure our apache logs to just output the json logstash expects.)

Graphite has a very simple and powerful JSON API [1]. Any graph URL can
include &format=json and you'll get back the raw JSON values for the
datapoints. I haven't used OpenTSDB yet - I'm curious how its API is better.

You get to choose the levels of summarization. If you want to keep 1 second
intervals for a year, you're welcome to do so.

[1]
[http://graphite.readthedocs.org/en/0.9.12/render_api.html](http://graphite.readthedocs.org/en/0.9.12/render_api.html)

------
stephengillie
That dashboard looks really neat. I've been searching for a good Windows
dashboard, and I like the patching views. Where's the download link?

~~~
KyleBrandt
None of this is open sourced yet. Nick is investing a lot of time to get what
we currently call "status" ready to open source.

Part of the reason for the open position linked to in the post (
[http://careers.stackoverflow.com/jobs/39983/developer-
site-r...](http://careers.stackoverflow.com/jobs/39983/developer-site-
reliability-team-stack-exchange) ) is to make it so we have more manpower to
get this stuff open sourced.

------
stevenjmurawski
jmelloy
([https://news.ycombinator.com/item?id=6334778](https://news.ycombinator.com/item?id=6334778))
had a really good point that wasn't covered in the post, since it was a post
about what we are doing not necessarily why we are doing it.

One of our major problems with existing monitoring and management systems is
the lack of good APIs. We are a shop of developers and sysadmins who all
understand the real management systems need to be composable. The system needs
to understand that it won't solve every case out of the box and expose hooks
into management and functionality, allowing us to tie disparate systems
together and enhance their coverage. I'd rather take a bunch of existing
products and put some cool dashboards on top, but most enterprise solutions
(and some of the open source ones) don't offer a decent API to work with.

~~~
dvanduzer
It's quite likely that I'm confusing "decent API" and "ease of extending and
integrating" and all that gets wrapped up with my long term familiarity with
Nagios.

Nagios check plugins don't have an API per se, but they have a very simple
standard for exit codes to be interpreted by the Orchestration component.

Nagios reporting/alerting plugins primarily use RRD, so you always have _that_
API to do interesting things like trend analysis.

(One theory I have is that there is a such a large and growing culture of
technologists that are thinking "integration tool" but only know how to _say_
"HTTP API")

------
dredmorbius
OK, I'll be that guy.

This makes that page much easier to read. Especially the headings, which are
all mashed together for some reason...

    
    
        #content {
            margin-left: auto;
            margin-right: auto;
            float: none;
            width: 40em;
            font-size: 15pt;
            line-height: 1.4em;
        }
    
        h1 {
            font-size: 180%;
            line-height: 1.2em;
            margin-top: 2em;
            margin-bottom: 0.5em;
        }
    
        h2 {
            font-size: 160%;
            line-height: 1.2em;
            margin-top: 2em;
            margin-bottom: 0.5em;
        }
    
        #wrap {
            width: 100%;
        }

------
cllns
An interesting consulting niche would be to help companies open source
software they want to release.

Basically:

\- clean up code,

\- make sure the infrastructure is sufficient,

\- help with marketing and adoption,

\- write documentation

~~~
coolsunglasses
Very few companies want to pay for that, fewer still when in-house and third
party devs are often begging to do that work if they'd just let them license
the code as OSS.

------
pionar
I wonder how much investment they have in this vs. going with a pre-existing
monitoring system like ExtraHop?

~~~
KyleBrandt
I wasn't aware of extrahop so I will have to look into that. We currently use
Solarwinds Orion which is where status gets some of its data.

We have definitely outgrown Orion, and a lot of stuff in Orion is very rough,
sloppy, and not well integrated.

We don't really like the idea of cloud hosted monitoring (which is a lot of
what more modern monitoring systems are). Alternatives also seem very
expensive.

So if we are going to make a investment (cash or labor) I would rather we get
a system that fulfills all of our needs (fit) and share it with everyone.

~~~
berkay
"We don't really like the idea of cloud hosted monitoring" Can you elaborate
on the reasons?

~~~
DougWebb
I'll bet one of the reasons is the amount of data. At a previous employer I
built a system similar to this which produced nightly, weekly, monthly, and
annual reports of weblog analysis for an app that had 6 million+ http requests
per day. It was tough enough to consolidate those logs across the load-
balanced servers in a single data-center. We never consolidated across data-
centers (separate reports for each instead) and I doubt we could have shipped
all of that data to a cloud service in a reasonable amount of time.

BTW, we made heavy use of setting and logging http headers too. One trick I
liked was capturing performance timing metrics as a request was processed and
stuffing it into a response header as the response went out. We then logged
the response headers, which gave us the ability to report on the performance
metrics. We also had a debug mode in the app on the browser side so we could
see the performance metrics from the headers there too.

~~~
dvanduzer
When was this?

~70 events per second doesn't sound like much to capture and aggregate. How
much of this parsing did you need to perform in real time? Creating a unique
token to pair requests/responses shouldn't add much overhead at all.

~~~
DougWebb
Development started around 2000, and stabilized around 2008. As far as I know
the reporting scripts are still being run every day. During this period we had
purchased a 1TB storage rack from EMC for a million dollars, to give you some
perspective on the differences between then and now.

\- No real-time parsing; it's all nightly batch processing after devops
rotates the Apache server logs to a storage volume. The logs sit there for a
while then get compressed and moved to offline tape archives.

\- No DB storage of the logs; space was too expensive and the Oracle database
we had couldn't have kept up. It was already heavily burdened with a
completely separate usage statistics system that fed into user-facing
reporting and billing, which had a much higher event rate, about 100x higher,
than the http logs.

\- We had unique tokens, but they identified a particular user session that
tied together all of the user's http requests from login to
logoff/abandonment, and which also tied into the Oracle-based statistics for
that user, that user's organization, and the customer responsible for the user
(often multi-organization). My reports had breakdowns for individual user
experiences, session-level metrics, and user
type/organization/customer/region/etc metrics.

\- I don't recall how long the analysis took; it was between half an hour to
two hours I think. A lot of that time was spent on disk I/O reading the logs.
I had optimized the parsing, analysis, and results recording about as much as
I could.

\- This stuff was written in Perl, and ran on Solaris servers from that time
era... probably not a lot more powerful than a handful of smartphones today,
though they did have lots of cpus. I don't think traffic has grown much since
I left the company (we had pretty full market penetration already) so it's
likely those servers haven't been upgraded.

~~~
dvanduzer
I suspected it would have to be a system of that era.

I think I have a good idea of how businesses (at a high level) have failed to
understand Moore's law from 2000-present. I'm curious what those failures of
understanding were like from 1985-2000.

We all know that technology has been advancing rapidly, but these specific
anecdotes of organizations paying a million dollars _just for the backing
storage_ of a system that you can essentially get for free from Google now...

~~~
DougWebb
Yeah, it's pretty amazing how much things have changed. That raw log data was
about 250GB/year which is nothing today but when we started collecting it we
were paying $1000/GB.

Actually, they're probably still paying over $100/GB. The whole datacenter was
outsourced to Perot Systems in the mid-2000s, and the storage fees were
astronomical. We calculated that Perot must pay a separate tech to stare at
each individual hard drive with a replacement in-hand in case any errors were
reported. At least, they could afford to do that with what we were paying them
for storage.

------
jmelloy
Where does the patching dashboard pull data from? Is it tracked by hand or is
there a scanner? We use Orion at work, and it's got a decent amount of data in
it, but is kind of kludgy and slow.

~~~
stevenjmurawski
The patching dashboard has scheduled jobs on the clients (PowerShell scripts
on the Windows boxes and Ruby scripts on the Linux boxes).

We use Puppet to deploy the client to our Linux boxes and for Windows we
deploy the task and scripts with Group Policy (soon to be replaced by Desired
State Configuration).

This information isn't something we need to poll often for (and on Windows,
there are difficulties interacting with the Windows Update apis remotely). We
add and replace servers often, so the client adds themselves to the dashboard,
as well as updating their status.

We have integrated some reporting into Orion, but that was a side effect of
not having the dashboard before.

------
jamra
I am curious to how they deploy database schema and stored procedure updates.
That is a much harder problem.

~~~
JasonPunyon
We have a forwards only migration runner that takes care of the deployment for
us. We don't use many stored procedures, but they're taken care of the same
way (in a migration). Our policy is that pushed code be backwards compatible
by however many migrations are being deployed in a production build.

