
Rearchitecting GitHub Pages - samlambert
http://githubengineering.com/rearchitecting-github-pages/
======
kyledrake
Pretty similar to how Neocities serves static sites
([https://neocities.org](https://neocities.org)).

There's a few differences. We don't use SQL in the routing chain, we use regex
to pick out the site name and then serve from a directory of the same name
(this is NOT as bad as it sounds, most filesystems can do this quite well now
and take MUCH more than half a million sites to bottleneck).

DRBD is also a little hardcore for my tastes. Nothing wrong with it, I just
don't know it well, and I don't like being dependant on things I don't know
how to debug.

An alternative I wanted to show uses inotify, rsync and ssh combined into a
simple replication daemon. It's obviously not as fast, but if you enable
persistent SSH connections, it's not too bad. If it screws up, you can just
run rsync. Rumor has it the Internet Archive uses an approach not too far away
from this for Petabox. Check it out if you're looking for something a little
more lightweight for real-time replication:
[https://code.google.com/p/lsyncd/](https://code.google.com/p/lsyncd/)

We're still working on open sourcing (!) our serving infrastructure, so
eventually you will be able to see all of the code we use for this (sans the
secrets and keys, of course). I've just been having trouble coming up with a
good solution for doing this. For now, enjoy the source of our web app:
[https://github.com/neocities/neocities](https://github.com/neocities/neocities)

~~~
charliesome
> _Nothing wrong with it, I just don 't know it well, and I don't like being
> dependant on things I don't know how to debug._

Yep, this is basically our approach as well.

We've been using DRBD for quite a long time now on our Git fileservers (which
also run in active/standby pairs - in fact, they look a lot like our Pages
fileservers) so we have quite a lot of in-house experience with it and it's a
technology we're pretty comfortable with. Given this, using it for the new
Pages infrastructure was a pretty straight-forward decision.

~~~
kyledrake
Yeah, I've read the engineering posts about DRBD for the git fileservers, so I
assumed that's why you made that decision. Makes total sense to me.

~~~
steveklabnik
This exchange is wonderful, and absolutely what I'd expect out of you two, and
maybe I'm just in a bad mood, but it stands in such contrast to the way I
often see technologies discussed online.

This kind of thing is the way engineering should be. Kudos.

~~~
patcon
Yay for acknowledgements! Yay for recursive enthusiasm!

------
jrochkind1
All of github pages was run off of _one_ server? (with one failover standby).

That's pretty amazing. If all you're serving is static assets, apparently you
have to grow to pretty huge scale before one server will not be sufficient.

I'm curious if there was at least a caching layer, so every request didn't hit
the SSD. They didn't mention it.

~~~
charliesome
We do have Fastly in front of *.github.io, but there's still a significant
amount of traffic (on the order of thousands of requests per second throughout
the day) that make it through to our own infrastructure.

We don't do any other caching on our own, although the other replies are
correct in that the Linux kernel has its own filesystem cache which means not
all requests end up hitting the SSD.

~~~
fugyk
What percentage of the requests is handled by CDN?

~~~
jsingleton
I believe that only CNAMEs use the CDN for custom domains (naked domains do
not) [0]. So that rules out a chunk but there will be lots of cache misses
too. Maybe someone from GH can confirm.

[0] [https://help.github.com/articles/about-custom-domains-for-
gi...](https://help.github.com/articles/about-custom-domains-for-github-pages-
sites/)

~~~
michaelmior
github.io will still also hit the CDN. It's just custom naked domains that
will miss.

------
adriancooney
I'm amazed that Github Pages ran on just two servers (well, aside from the
MySQL clusters). That is absolutely incredible given the sheer amount of
projects and people who rely on it for their sites (me included!). I love the
philosophy behind Pages of abstaining from over-engineering and sticking to
the simple, proven solutions. It's a great service and I'm a massive fan.

~~~
joshmoz
It's cool that they were able to do this with two machines, and I don't want
to detract from that, but it's probably worth pointing out that a "machine" is
not a very useful unit of capacity. These two machines could be dual core i5s
or they could be 20 core xeon boxes with hugely varying amounts of memory and
storage. Too bad they don't clarify, I'm curious.

~~~
ceequof

      The fileserver tier consists of pairs of Dell R720s 
      running in active/standby configuration. Each pair is 
      largely similar to the single pair of machines that the 
      old Pages infrastructure ran on.
    

[http://www.dell.com/us/business/p/poweredge-r720/pd](http://www.dell.com/us/business/p/poweredge-r720/pd)

Two Xeon E5-2600s per machine.

~~~
ChiperSoft
The ES-2600 series comes in 4 to 12 cores, so that's still a pretty wide range
of compute power.

------
VeejayRampay
Well done GitHub. Also a special mention to the invisible workers making nginx
such a cornerstone of the modern infrastructure, it's a project that I don't
hear about often, probably due to the fact that it's not the sexiest piece of
technology, but it really seems solid and battle-tested. Kudos.

------
nicolewhite
I've been using GitHub pages for a while now and I always wondered why they
had the "your site may not be available for another 30 minutes" message on
creating a new GitHub pages site while pushes to an already-existing gh-pages
branch were displayed instantly. Neat to see that explained here.

------
maxmcd
This is all sitting behind a CDN correct?[1] Might explain why it was able to
survive on two servers for so long.

Or is that automatically assumed when reading about a static hosting setup?

1\.
[https://www.fastly.com/customers/github/](https://www.fastly.com/customers/github/)

~~~
manigandham
It's mentioned in the article.

> We also have Fastly sitting in front of GitHub Pages caching all 200
> responses. This helps minimise the availability impact of a total Pages
> router outage. Even in this worst case scenario, cached Pages sites are
> still online and unaffected.

------
mwcampbell
Only tangentially related, but I sometimes wonder if GitHub's management
regret making GitHub Pages available for free, now that it's being used so
heavily for personal and even business blogs, rather than just companion sites
for open-source projects. They could be charging for static websites, as
Amazon S3 does.

~~~
holman
I never heard anyone gripe about it... not even once. The cost is pretty
negligible, and there's a lot of halo benefit (i.e., you just get more people
involved on GitHub the platform itself).

The fact that a lot of non-technical employees in marketing and other fields
are using it for corporate blogs is actually a nice bit of pressure on the
organization to make Pages and web editing even simpler for those users. It
becomes harder to lean on "oh it's a developer site so they'll figure it out".

Mostly, though, I think it's just a matter that we wanted it for ourselves.
It's pretty awesome from an industry bystander's perspective to have something
free, simple, and static, so we can all benefit from more stable docs, blogs,
and so on. Maybe that'll change in the future and something Totally Different
will change the industry, but for right now I think it's pretty rad, and
totally worth the investment.

~~~
sneak
You've just optimized for my happiness. Thank you.

------
jsingleton
Nice! Does HN still run off of a single server and CDN too?

The CDN is key here, which you get if you use a CNAME (or ALIAS) instead of an
A record for your custom domain on GH pages. I've found pairing pages with
CloudFlare works great if you want to use a naked domain and you get HTTPS
too. You can set up a page rule on CF to redirect all HTTP to HTTPS as well.

------
nvk
It's time for github to start offering some basic hosting infrastructure of
small projects, a light heroku, at least for JavaScript (which kind of already
works).

I'd pay extra for that, I (we all) have a bunch of personal sites, landing
pages, marketing sites and tiny side projects that'd love to not have to deal
with hosting – I think they'd make a killing, but also think must be in the
works.

~~~
spdionis
I think the leap they'd have to make in infrastructure and architecture to
support that is not worth it in their mind. But who knows.

~~~
chralieboy
Also the mental jump. Just because something is easy for them to do doesn't
mean it is worth the distraction cost.

Github builds tools for developers. Atom, chat (abandoned), Pages, Gists, and
github.com all fit within this. They tie into how teams operate. Serving JS is
tangentially related — certainly something a web developer does — but not
really core to their mission.

------
ngrilly
Great summary of your architecture. Thanks for sharing!

A few questions:

\- Is everything in the same datacenter or in different datacenters? What
happens if the datacenter is unavailable for some reason? Are data replicated
somewhere?

\- You moved from 2 machines to at least 10 (at least 2 load balancers, 2
front ends, 1 MySQL master, 1 MySQL slave and 2 pairs of fileservers). That's
a lot more. Do you need more machines because you need more capacity (to serve
the growing traffic) or just because the new architecture is more distributed
and requires more machines by "definition"?

\- I understand the standby fileservers are idle most of the time: reads go
the active fileserver, only writes are replicated to the standby. Am I
understanding correctly? If yes, it looks like a "wasted" capacity?

------
jsingleton
Something I would really like is to be able to set the custom MIME type for an
app cache manifest file. That way you could easily host offline web apps from
GH pages. Anyone know a way to do this without using S3 or similar?

[https://en.wikipedia.org/wiki/Cache_manifest_in_HTML5#Basics](https://en.wikipedia.org/wiki/Cache_manifest_in_HTML5#Basics)

~~~
ryanseys
You shouldn't need to specify a custom one. GitHub Pages will automatically
serve the file with the appropriate mime type given its file extension. Here
[1] is the list.

[1]:
[https://github.com/jekyll/jekyll/blob/master/lib/jekyll/mime...](https://github.com/jekyll/jekyll/blob/master/lib/jekyll/mime.types)

Edit: As you can see in that link, both .manifest and .appcache file
extensions map to text/cache-manifest mime type.

~~~
jsingleton
Nice. Very helpful. I must have been using the wrong (or no) extension before.

------
datums
Have you thought about using bind as the db for the routing ? an internal dns
lookup for the storage node storage.url -> 10.0.12.1

------
BillinghamJ
Seems odd to me that the router hits a MySQL database on every single request
rather than just hashing the hostname as the key for the filesystem node.

~~~
tdicola
Hash-based partitioning has a big problem that when you change the hash size
all of the data moves around. Eventually you'll need to do a lookup-based
partitioning scheme. You also probably want control over where some users live
since you don't want two super hot users on the same server.

~~~
TheLoneWolfling
There are hash schemes that prevent that, however. Look up consistent hashing.

And you could probably have a (small) hashmap that allows overrides for the
most used pages.

~~~
twic
My thought too. If they have half a million sites, and only 1% are hot enough
to need deliberate placement, then they only need to store five thousand
special cases, which is a few megabytes of memory, easily stored on each load
balancer and loaded at boot.

Use hashing for the rest. Ideally consistent hashing, or rendezvous hashing,
which i just read about on Wikipedia so must be good:

[http://en.wikipedia.org/wiki/Rendezvous_hashing](http://en.wikipedia.org/wiki/Rendezvous_hashing)

------
linc01n
I thought github pages is running on riak and webmachine from 2012[0].

[0] [https://speakerdeck.com/jnewland/github-pages-on-riak-and-
we...](https://speakerdeck.com/jnewland/github-pages-on-riak-and-webmachine)

~~~
jnewland
I threw out that prototype soon after the talk. At the time, there weren't a
lot of other engineers at the company doing Erlang, so maintenance was
considered to be a long-term problem. I'm glad we made that call.

~~~
misterbee
There are so many presentations of the form "We're using
Erlang/Scala/whatever, it's so awesome!", but so few followups when they give
up on the idea for production..

~~~
troutwine
It's hard to sustain some alternate technology in the face of common
knowledge. Rarely does technical advantage outweigh hard-won operational
experience.

------
tracker1
It seems to me, they could have gone a farther step removed via something like
Cassandra. With a Cassandra cluster, they could have used a partition key that
is the domain name + route in question, they could then do lookup against that
entry, with the resource path (excluding querystring params) could be used to
find a single resource in cassandra, and return it directly.

A preliminary hit against a domain forwarder would be a good idea as well, but
for those CNAME domains, dual-publishing might be a better idea... where the
github name would be a pointer for said redirect.

While Cassandra itself might not be quite as comfortable as say mySQL, in my
mind this would have been a much better fit... Replacing the file servers and
the database servers with a Cassandra cluster... Any server would be able to
talk to the cluster, and be able to resolve a response, with a reduced number
of round trips and requests... though the gossip in Cassandra would probably
balance/reduce some of that benefit.

~~~
samlambert
Adding a database that is new to GitHub would not be a pragmatic move.

------
Sir_Cmpwn
I remember the time I mistakenly drove huge amounts of traffic to Github
Pages, believing they had the infrastructure to handle it. I apologise for
last year's downtime :)

Glad to hear it's being improved. I'm impressed that it was able to run on
such simple infrastructure for so long.

~~~
cddotdotslash
I doubt you caused any downtime. All the page content is behind a CDN.

~~~
Sir_Cmpwn
I did, confirmed by the angered emails from GitHub and the suspension of the
relevant repository.

------
methyl
I'm wondering why GitHub prefers MySQL over PostgreSQL.

~~~
samlambert
Stable, proven, popular, great roadmap, great replication story, great tooling
and an awesome community.

~~~
brador
Would postgreSQL be meaningfully faster?

~~~
twic
My heuristic is that MySQL is faster for inserts and single-table reads by
primary key, but PostgreSQL is faster for more complex queries, particularly
if they involve joins or subqueries.

Basically, MySQL is a key-value store in an RDBMS's clothing.

~~~
tracker1
On the flip side, PostgreSQL's high availability options aren't in the box, or
are at the very least varied, problematic and/or cost more than other options
for support contracts.

Every time I've worked with mySQL I've seen some irksome behavior... just the
same, setting up failover options is miles ahead of PostgreSQL. And when you
already have in-house talent, it becomes even more obvious.

My only thought was that using a clustered database (such as Cassandra) as the
store with the data itself might have been better. domain/url (minus
querystring) would hash/distribute fairly well, and with even a relatively
small cluster with 2 replica nodes for the shard would be pretty effective.
Also, it would be easier to manage a replicated database, in my mind, than
tracking sites to pairs of static servers. GoDaddy is/was moving to something
similar with new development on one of their applications when I worked there,
and able to serve a huge number of static requests (hundreds of thousands per
second) off of a relative few servers with a sub 10ms response time, for
content not backed by cdn.

In the end it just goes to show that serving static content on modern hardware
can scale _really_ well, with a number of options for technology. Which is why
I'm somewhat surprised that something hasn't taken over the tide of poorly
configured Wordpress blogs.

------
baghali
Question from Githubbers:

Have considered using cluster file systems such as GlusterFS or Ceph?

~~~
kyledrake
I looked into GlusterFS at one point. GlusterFS is a no-go for static file
serving in hostile environments. It asks every node to look for a file, even
if it's not there. You can imagine the DDoS attacks you could build here using
a bunch of 404 requests for files that don't exist.

One story I heard from a PHP dev is that it would take 30 seconds to load a
page while it looked for all the files needed to run it.

~~~
jldugger
Yea, GlusterFS is terrible at PHP, or anything involving lots of small files.
Like, static sites or say git repos.

------
el33th4xx0r
Surprisingly, they uses mysql (instead of current hype k/v store) to map
hostname and fileserver.

~~~
charliesome
MySQL is actually a really good key value store!

Here's the schema we use for the routing information:

    
    
        CREATE TABLE `pages_routes` (
          `id` int(11) NOT NULL AUTO_INCREMENT,
          `user_id` int(11) NOT NULL,
          `host` varchar(255) NOT NULL,
          PRIMARY KEY (`id`),
          UNIQUE KEY `index_pages_routes_on_user_id` (`user_id`)
        );
    

Since we use MySQL for everything else, we decided it made the most sense to
keep this routing data here rather than introducing a new database.

~~~
toomuchtodo
> Since we use MySQL for everything else, we decided it made the most sense to
> keep this routing data here rather than introducing a new database.

Very smart. "Perfection is Achieved Not When There Is Nothing More to Add, But
When There Is Nothing Left to Take Away"

------
joshrotenberg
I remember seeing a talk about a Github Pages rearchitecture at Erlang Factory
in SF in 2012: [http://www.erlang-
factory.com/conference/SFBay2012/speakers/...](http://www.erlang-
factory.com/conference/SFBay2012/speakers/JesseNewland)

I don't see anything about those components in this post. Did that
architecture never make it to production?

~~~
krallja
answered elsewhere -
[https://news.ycombinator.com/item?id=9612914](https://news.ycombinator.com/item?id=9612914)

------
serverholic
Can you explain a bit more about the load balancer tier? The blog post doesn't
say much about it and I'm curious about your haproxy config.

~~~
jssjr
This sounds like a great idea for a future blog post. Thanks for the
suggestion!

------
gabrtv
I was hoping to see something about optional SSL/TLS. I'm even willing to pay
for it.

~~~
jsingleton
You can use CloudFlare with flexible SSL in front of it and set up a page rule
to redirect to HTTPS.

~~~
gabrtv
I had no idea. That's great news!

~~~
drewbug
Be warned, though: this will only create a secure connection between the user
and CloudFlare, not between CloudFlare and GitHub Pages.

~~~
tracker1
While a potential for concern... I would guess that CloudFlare's own routing
would take them to an exit node close to Github's... that does make quite a
few assumptions though. It would still be more complicated to mitm between
CloudFlare and Github than ISP-X (or China) and Github...

------
oblio
Somewhat related, what's the SLA for Github pages? Let's say someone wants to
move a highly successful blog, are there some quotas? Maybe I missed them but
I don't remember seeing any...

~~~
benbalter
Ben from GitHub here. We don't offer an SLA for GitHub Pages (or GitHub.com
generally).

We've talked a bit about our typical traffic in the past
([https://github.com/blog/1992-eight-lessons-learned-
hacking-o...](https://github.com/blog/1992-eight-lessons-learned-hacking-on-
github-pages-for-six-months)), but suffice to say, we host some highly
trafficked sites like the Bootstrap documentation.

If your blog isn't on that scale, and conforms to our terms of service
([https://help.github.com/articles/github-terms-of-
service/#g-...](https://help.github.com/articles/github-terms-of-
service/#g-general-conditions)), we generally don't enforce hard usage quotas,
but if you're concerned, feel free to reach out to support@github.com any
time.

~~~
icefox
I know users are using Travis-ci to automatically build when there are new
commits in master and push back the results to the gh-pages branch with a lot
of success.

Github pages is so cheap and easy it is the disruptive technology that is
eating the lunch of stand alone hosting. Users don't think about servers,
deployment or other things, they simply are pushing a branch and poof it is on
the web.

Do you have a wishlist of features to add to GitHub pages? Maybe allowing
minimal sandboxed server side computation with a max runtime of say 1ms,
setting headers, redirects or other stuff? I am guessing every little addition
would eat away at the alternatives.

~~~
benbalter
> Do you have a wishlist of features to add to GitHub pages? Maybe allowing
> minimal sandboxed server side computation with a max runtime of say 1ms,
> setting headers, redirects or other stuff? I am guessing every little
> addition would eat away at the alternatives.

Moving from WordPress, the fact that you couldn't constantly tweak a thousand
little things was extremely liberating. That's the zen-like simplicity of
GitHub Pages that to me, makes it an attractive option over heavyweight
alternatives. Just push and your site is live. Fewer things to break and fewer
things to worry about means more time to focus on what matters: your content.

------
nemothekid
I've never known about this way to extend nginx - I've been looking at a way
to get nginx "smarter" on our mesos config as dns-based configuration had some
holes.

~~~
lamby
Curiously, I remember sticking with lighttpd on a project for quite a while
simply for their Lua integration (nginx didn't have it at the time).

------
wcdolphin
I love this write up. Awesome example of how to build the minimal, simple
thing first and expand as traction and needs develop.

------
k_bx
I find that for such tasks it makes sense instead of trying to hack around
nginx config and its lua scripting, to throw it away and write a small app in,
say, Haskell+Warp that would do the job. It's as fast as nginx (and probably
faster than nginx+lua), would have much more static guarantees, express logic
more clear.

------
siliconc0w
If you use the magic of hashing you can lookup where user's pages are without
the dependency on the database.

~~~
charliesome
Unfortunately this isn't an option for us. We need to be able to move sites
between fileservers periodically to keep disk usage and load balanced. Adding
new fileservers when using a hash modulus based routing scheme is also quite
complex as it would require copying quite a lot of site data between
fileservers.

Pages also supports custom domains for both user sites and per-project sites,
so we'd still need a way to resolve domains to users.

~~~
emj
Think about moving around hash buckets, not sites nor files. You can decide
what you want to keep in memory data or metadata. Now you need both sites+file
metadata, neither of those need to be stored when serving static files you
yourself rename and place everytime they are published.

------
spdionis
Great example of YAGNI, related to the post by martin fowler posted earlier
today on HN.

------
nosideeffects
Why did they choose the 98th percentile, specifically? Why not 99th and
friends?

------
rattray
I hadn't heard of Fastly before. How do they compare to Cloudflare?

~~~
nullrouted
Fastly is actually really great but they cost $50/month to start with. Their
throughput and routing is amazing through.

~~~
pbowyer
I hear many great things about Fastly from users, but whenever I look at CDN
performance comparisons (e.g. [https://cloudharmony.com/reports/editions/cdn-
performance-re...](https://cloudharmony.com/reports/editions/cdn-performance-
report/basic/cdn-performance-report-0914.pdf)) they never do brilliantly for
US/EU regions. What gives?

------
outworlder
Even if Lua is all but a footnote, it's nice to see that, yet again, things
that rely on it work as advertised.

------
sneak
Guess I'm learning Lua now. :D

------
coolrhymes
awesome write up. I thought i was the only crazy guy to use lua to route
subdomains to internal servers. Little surprised on the MySql part instead of
redis like fast look up key.

------
wiradikusuma
off-topic, but since we're talking about github pages.. i own github.id, and
i'm thinking of making "linkedin for github users", do you guys think there's
a market for that?

~~~
tsm
Wouldn't there be huge copyright issues with that?

~~~
krallja
No. "GitHub" is a trademark, not a copyrightable work.

