Hacker News new | past | comments | ask | show | jobs | submit login
Bitly: Lessons Building a Distributed System that Handles 6B Clicks a Month (highscalability.com)
175 points by aespinoza on July 14, 2014 | hide | past | web | favorite | 64 comments

"4 boxes with 1x RAM will always be cheaper than 1 box with 4X RAM."

Is that really true? RAM is dirt cheap compared to everything else that goes into a server.

And then later under "Lessons Learned" we have: "Put it all on one box if you can. If you don’t need a distributed system then don’t build one. They are complicated and expensive to build."

This varies by the amount of RAM, if the total RAM being talked about is <= 256 GB (or in some cases 128) this statement will usually be false. But for RAM totals above 256 GB (and in some cases 512) it is almost always true. But these watermarks will change over time so this assumption must but re-evaluated from time to time.

$WORK anecdote:

We had a quartet of legacy-ish databases with sharded customer data (had some interconnection, but we sharded in the application-layer, so the database didn't have to move any data internally). It was a bitch to maintain, but it performed OK, by throwing fairly beefy machines at it.

We began a quest to replace the database with MongoDB/HDFS/Riak/CouchBase/Cassandra/whatever. We wrote a script that converted our data to JSON for easier ingestion and began hammering away. Until we discovered that by converting it to JSON (and simplifying some parts of the data-model), our dataset had actually become small enough to fit on a single server in memory (still ~100GB).

So we ditched the cluster-thingie and moved everything to CouchDB. Pricy hardware, but the saved man-hours recouped that within a month or two.

(And in half a year's time, the database will have outgrown whatever hardware we can reasonably throw at it, so then we'll have to start over for real...)

That's odd. When I last checked, CouchDB took a lot more memory. Unless you are doing something very odd with how you model your data ("select (key, value) from attributes where recored_id = ?" ... why are you even using SQL then?).

I expect that slimming down the data model was the main factor, or you simply didn't want to use SQL at all (or your data wasn't really structured enough).

But why throw it in couchdb when you could use a rdbms ?

We don't need a RDBMS; we get along quite well with key/value + keyscans, making it a lot easier to switch to a database with roughly similar features later (CouchBase, Cassandra, HBase, possibly others)

OK, slightly ignorant question here, but SQL is kind of a standard for RDBMS's. As far as I am aware no similar starts exists in the NoSQL world, so I would have assumed that it would make switching harder. If you are just using it for basic stuff, then "SELECT * FROM table" is pretty standard across RDBMS's.

Reading this: (And in half a year's time, the database will have outgrown whatever hardware we can reasonably throw at it, so then we'll have to start over for real...)

You need a rdbms. Field names won't be repeated for each query and you will push later the time to 'start over'.

It must be web scale.

I haven't run the numbers, but I would imagine that 4 boxes with 256 GB of RAM are significantly cheaper than 1 box with 1TB of RAM, especially if the rest of the box isn't top-end equipment.

The density of the DIMMs is where the price difference jumps. Right now 32GB sticks push the price way up. Common current chassis have 24 slots letting you hit 384GB total with no big surprises on 16GB sticks.

Getting to 1TB would certainly require 32GB sticks and would be the point where multiple machines gets cheaper.

At that scale... you may be right. Jumping above 256 GB of RAM on a server is still pretty expensive.

If you're building your own boxes, sure the price of physical sticks of RAM aren't the bulk of the cost, and you can build high-RAM boxes.

If you're renting servers or virtual servers (VPS/cloud), then the physical cost of RAM is irrelevant. Virtually all hosting companies use configured RAM as a proxy for their costs (the more RAM you need, the more likely you are to be using more CPU which means more electric, more disk IO, more bandwidth, and generally putting more wear-and-tear on the hardware which will need replacing sooner).

An 8GB stick of ECC RAM might cost you less than $99 to purchase outright, but Softlayer (the #1 most used bare metal server company by YC startups) will charge $88/month for it ($11/GB/mo to add 8GB to one of their base configurations).

> An 8GB stick of ECC RAM might cost you less than $99 to purchase outright, but Softlayer (the #1 most used bare metal server company by YC startups) will charge $88/month for it ($11/GB/mo to add 8GB to one of their base configurations).

I never understood this. I started and sold a hosting company in the early 2000's. I've run managed hosting divisions for Fortune 500 companies. Hosting is STUPID profitable (>30% margins). Why doesn't Y Combinator grab 10-20 racks in a cheap colo center, buy $100K-$200K worth of equipment, and provision that to YC startups similar to Rackspace's OnMetal offering?

I do DevOps now, but colo is still not as hard as everyone makes it out to be.

Who'd maintain & administer it? As I understand it, YC is not in the business of providing business services to its portfolio companies. They setup the network and information, but they don't really have headcount or hiring expertise to bring functional roles in-house and then rent them out to portfolio companies for cheap.

It's an interesting question whether there's a market for an investment firm that does do this. One of the things that appealed to me about Google under Eric Schmidt was that it basically functioned as a venture capital firm that had shared technology infrastructure, server maintenance, legal, PR, etc. but let you choose what to work on and come up with your own product ideas, with the one caveat that Google would own any business you built. It'd be very interesting to see a VC firm that did the same - cover all the functional roles so that entrepreneurs could focus on pure product development. I hear a16z is sorta shooting for that, but they cover more of the human-capital stuff like recruitment, executive search, etc. rather than the technical stuff.

There isn't an obvious line from "save 30% on your hosting bill" to "grow by 8% this week", which suggests that it isn't terribly high on YC's list of priorities. I can think of a few YC companies with notable infrastructure requirements but for many of them it's below the noise floor.

I'd be interested in the yearly/3-year cost data of infrastructure based on a VC's total tech portfolio. I can't believe it to be insignificant. If you're a tech startup, your two biggest costs are labor and infrastructure/hardware.

The fully-loaded cost of a full-time Rails programmer is, let's say, $15k per month.

An Amazon m3.large with 7.5GB of RAM and a 32GB SSD costs $64 per month if I reserve a year in advance. (And why wouldn't I, given that the proposed alternative is to buy hardware?)

If I wave the magic wand and save 30% on the cost of each m3.large, I save $19 per month per instance.

If my company runs the equivalent of twenty m3.large instances per programmer, 24 hours a day, for a whole year, my magic-wand-powered savings will add up to six additional programmer workdays - about $4600 per programmer. If I start out on January 1 with a year of runway, I'll go broke on January 9 of next year instead of January 1.

So, even if I were to believe in the existence of magic wands, they wouldn't make much of a difference.

You're raining on my thought experiment :)

If you're a tech startup, your two biggest costs are labor and infrastructure/hardware.

I am aware of many companies where advertising, conferences (throwing their own), laptops/mice/etc, rent, general office overhead, credit card processing, and about five other things I'm forgetting are still higher than infrastructure/hardware, in addition to direct costs of employment. (Obviously, less true if you run Dropbox, but COGS is looooooooooow in most SaaS companies for a reason.)

Because Amazon (AWS), Heroku and Rackspace already give every YC company (and a lot of other startups) free hosting worth more than $100,000.

RAM is cheap, and the rest of the box costs more than the RAM, so I'd say no.

Let's say a 1GB module costs $25 and a 4GB one is $40. (Just picking some random Corsair ones of compatible specs and disregarding things like using matched pairs of lesser capacity instead of single modules. For simplicity.)

Two of those 1GB modules would already cost more than one 4GB one. And that's to say nothing of the rest of the hardware, which you can entirely avoid if you have a single system instead of four.

The only place it makes sense is VMs through a cloud provider, where memory is often one of the big pricing metrics. It's possible there could be savings through some of those providers.

That's true for this scale, and probably up to 32GB or so. But bitly is running at much larger scale and he is talking about hundreds of GBs, which is a totally different game.

I don't think that they are putting 1GB or 4GB modules in their boxes.

6 Billion Clicks is not the hard part, Storing analytics for those clicks is. Though this is a pittance compared to what Google, or Adobe serve with their analytics.

I don't know enough about how "realtime" Bitly is. Handling 6 Billion writes a month of "raw" data from a user as a serialized string would not be that hard.

2300 clicks and there for analytic writes per second average for the month is likely based on the 80/20 rule about 10k Clicks per second peak.

Now, if we assume that you write a Serialized Write immediately with all the data from a users, and then do a chasing analysis, so that you don't have to do all the work at Peak Price... and that each user is then 20 writes.

We end up with 10k peak Serial, and 5x that in chasing writes. So we need 60k writes per second.

On DynamoDB that would cost $90k upfront and $7.71 an hour. ($99,500 a month)

That is "a lot" but it isn't huge.

Doing this on Google AppEngine would likely be about the same since you pay fixed fee per write, not based on your throughput. Depending on the amount of indexing you would pay $1.80 - $2.40 per 100k clicks based on the above math so $108k - $144K per month.

I am not familiar enough with Azure to quote a price.

I know there would be other costs. This is just the database portion. But as I expect this to be the majority of the price, I thought it was the part most worth discussing if you were building a Bitly on a Cloud Platform.

I use to run an ad server where numbers of requests like this per month would be considered low(we would regularly do hundreds of millions of requests per day peaking at around a billion per day). We had the added fun of requiring extremely low latency and we couldn't toss 400 servers at it either.

With chasing writes you do it in slow periods between traffic bursts since you're basically just pulling them off a queue to process so you don't need to count that in with your peak burst numbers.

Your costs seem really high too. The above system was ~$10K/month on GoGrid including 2 DB servers that were on dedicated servers(not really impressive ones either I think they were ~$500/month each), a load balancer on a dedicated server, 2 dozen webservers or so, a few support servers(admin panels, client interfaces, puppet, etc), and a small hadoop cluster.

Redis would receive the raw data, the DB stored the rolled up data and the raw data/logs would be compressed and go onto a small hadoop cluster in case we needed to process it for a new type or report or look for something specific.

I agree, I was trying to keep the math simple, and give some play for bigger peaks.

An ad network is another great example of where this would be tiny numbers.

And with Ads more than Analytics you have to be aware of Race Conditions, and have to do more management of reads and writes so that you don't over or under serve a campaign.

FWIW, this is about 2,300 qps on average. Nothing to sneeze at, but the per-second scale removes the awe factor of "6 billion a month".

If you want to be a stickler on significant figures, it's actually 2,000 qps.

And 30 servers handling the frontend! Even if they peak at 10,000/sec, that's only a few hundred a server/sec. And another 370 servers to do other stuff.

Right now, we peak at 11 servers for ~550-600 rps - those are AWS c3.medium servers. We're moving from Python to Go to try to squeeze more out of each server.

But our bottleneck is MySQL, and are moving to Riak. Our DB is the only part of our stack that isn't inherently horizontally scalable - which seems to be the case for a lot of services that are hitting that 500 rps rate (maybe 750 qps or so).

Does it bother anyone else that services like bitly add a single point of failure to URLs? Also, it's kind of amusing that they've spent so much engineering effort to distribute a protocol that's already massively distributed by default.

The important bit for most people:

> Put it all on one box if you can. If you don’t need a distributed system then don’t build one. They are complicated and expensive to build.

For most businesses that aren't about massive scaling, your time is probably better spent on marketing, new features and so on.

Interesting for me: they are, or were, users of Tcl.

> For most businesses that aren't about massive scaling,

Most business should care about distributed systems because they (often) need some form of high availability. Sure, you're not scaling to Google or New York Times levels here, but if you want to survive any of the random crap that can happen to you, you can't have a single host per role.

Yes, sure, you may not be worried about concurrent writes to multiple replicas, but what about reads from multiple replicas, or a read to a slave after a write to a master (lets say master just died in a horrible manner, and slave is now being hit with reads and writes)?

PS: TCL is awesome.

> (often) need some form of high availability

I'd guess that, in reality, many don't really need high availability. I mean, it's not going to hurt if they have it, but if it comes at a cost of not doing other things with their limited resources, it may not be the best thing to be doing.

BTW, it's Tcl, not TCL!

Gah, I should know better about the name. Let's just pretend I was yelling it because I was excited? k?

Anyway, it's a gradient. Sure, most places don't need 99.9999% availability, but most places also can't afford to be down for a few hours to a day, especially during their peak season.

So, do you need automatic failover and promotion? Probably not. Do you need contingency plans (with may/should include accounts with multiple vendors (even if something like a LiNode account with no boxes)) and have practiced brining everything up (or if you're e.g. a small retail shop, how to checkout), yes, most defiantly.

Even if your distributed system isn't "moving" you still need to plan for things. Like I said, if your database server becomes unavailable, you need to know how much data hasn't made it to the spare, otherwise you need to ask "How much will I loose if I restore from my last backup". Things of that nature need to be planned around and known, even if the code doesn't need t care.

6B a month =~ 2k per second. Just putting the numbers in perspective :-)

That's assuming a linear distribution of clicks and we all know it's never the case. It could well be over 10K sustained for peak hours. Averages are often misleading :)

Without wanting to detract from the engineering accomplishments here (which I have never come close to), it's important to note that low-latency does not appear to be a design criteria i.e. it's okay if it takes a couple minutes to process events during peak load, which means there is some leeway for smoothing the input peaks over processing time.

Furthermore, these are just click events. It's okay to lose a few, so the design doesn't have to be especially good at making sure events aren't lost, as far as I can tell.

A good design is as much about what is left out as what is left in, which is the lesson between the lines here in my opinion.

My thoughts exactly. If you come to my house and ring the bell, but it takes me 15mins to get off the couch to reach the door, I can't really claim to be "available" with a straight face. Sure, I "technically" am available, but that level of latency is not practical.

The system is still available if your ring gets an acknowledgement of receipt. The latency for the request to be served is a design metric and has no impact on "availability"

>> The system is still available if your ring gets an acknowledgement of receipt

This is the equivalent of (as per my example) yelling out "I hear ya! Coming..." and then take 15mins to reach the door. I never said that low latency implies anything about the level of availability; I merely meant that arguing about the availability of systems is incomplete without a thorough discussion of latency.

In the case of Bitly, I'm just curoius about the systems that are highly available but "require" low latency vs systems that don't require it. As ryanjshaw points out, the system may have a degree of tolerance for lossy click events. If you have a heterogeneous mix of systems with different tolerance levels, that surely affects the architecture does it not?

They say that the clicks and shortened URLs are handled synchronously. It is the recording of the clicks for analysis that is done asynchronously.

Annoyingly, for something designed to scale as much as Bitly does, it has some very odd wrinkles.

Whenever I visit my stats page at https://bitly.com/a/stats it shows me the stats from the last time I visit, and I have to do a hard-refresh with Ctrl-F5 to get the latest version.

Or possibly this is part of how you scale - only retrieving the latest versions of stats when you need them. And not with any accuracy (clicking on a given day pretty much never gives me numbers of clicks that add up to the total for that day).

While reading people talking about writing data to permanent store, I got an interesting (well at least for me!) question.

My initial idea was that it would be possible to store data to be written at RAM at first, and periodically flush to hard drive / DB. But then on other hand, OS does that already by using cache and flushing it, just at much lower level. Some DBs are probably doing it too.

So the question is, is it worth implementing strategies like that (home made cache), or it is a better idea to trust OS/DB by default?

Trust the OS, and don't use 70's technique like separating RAM and disk. Here's a really interesting piece on how PHK built varnish by using "modern" software techniques:


And here's a Key-Value db you can use in your own programs that reuses the same idea, ie ask the OS for some space, trust it for RAM/Disk allocation; it turns out it is excellent at what it does thanks to careful design:


> My initial idea was that it would be possible to store data to be written at RAM at first, and periodically flush to hard drive / DB.

Sure, but you lose anything that isn't written to disk. IIRC you can set up Mongo and Redis to do this (I'm not familiar with many of the other options). At this point you're no longer an ACID database.

Most DBs will cache as much as they possibly can in memory. Often index's will take priority, but rows that are being accessed fairly frequently will also be in their cache. Obviously the details are different per db and configuration, but in general, it gets cached.

Also, when you fwrite your data actually sits in RAM for a bit before being written to disk. Databases and other applications that actually require data to be on the disk right now need to call fsync.

(Aside, even then you're disk can and will store writes in it's own internal buffer before it can be written to disk. The OS doesn't have much control over this.)

A lot of people will, to avoid having to spin up db replicas, cache rows in redis or memcache and use those for reads whenever possible. It has its advantages and disadvantages, but for pk-based reads you can decrease the number of operations your DB needs to do at once (allowing it, for example, to focus on writes.

I can speak from experience on this, home made cache can be much better depending on your avg object sizes. Letting the OS make decisions usually in databases is the wrong choice. The data is cached at page level (4k), if you have many objects randomly spread across your working set it's not going to be as efficient as object caching. Usually the more control you desire the better for the performance of your database. One technique that is used if you want to just let the OS deal with cache/paging of data is you have a dedicated commit log that is written with fsync on every update or periodically depending on your durability requirements.

The OS usually does a pretty good job, but remember it's designed to be generic not for a specific use case. When you really care about performance of a database you tend to bypass what the OS does, and take control bypassing things like the scheduler, cache, etc.

SAP HANA (and then to OCFS or GPFS) does this: http://en.wikipedia.org/wiki/SAP_HANA

And AFAIK, InnoDB (then to MySQL) does as well.

This article is about horizontal scaling and distributed systems.

Some commenters here are missing the point.

"Or, you could just avoid the OS altogether" [and get 40 million requests per second]:


6B requests/mo = ~2200/sec 40M requests/sec = 105 trillion/mo (or the number of red cells in the human body times 5)

Packet handling != application logic

But yes, there is some interesting work being done on building network applications directly against the hardware.

2300 a second sustained. I was doing that on PA RISC under HPUX in 1995. BFD.

1. Don't be a dick.

2. It's obviously more than 2.3k sustained; there will almost certainly be variance, maybe up to about 10k per second.

3. It does seem rather unlikely that you were serving 2300 clicks per second in 1995, given the minuscule scope of the web at that time. That said, if you were, I'm sure many of us would be interested in hearing about it. It would probably be more productive than bitching.

The Shanghai Stock Exchange trading system. In memory, simple text based messaging over TCP protocols. The main reason for my distain is the trivially shardable nature of URL shortening. Analytics is non real time so it does not count. Down vote away.

So in other words, you did something totally different.

I helped admin a site in 2001 that did tens of thousands of hps to dynamic content on a mod_perl app. And Bitly's doing static content (301's) of what, a 500 byte payload? Our static site layer did hundreds of thousands of hps with minimum a half meg of payload.

The guy's got a valid point. Bitly's using way overcomplicated tech and employs way more engineers than you need to host a trivial amount of traffic for a large scale site. This is a textbook case of over-engineering.

It wouldn't surprise me if that was a standing order from the top - make sure it can scale to everyone on the planet. I built a service not too long ago where they were expecting 50M users in the first year. Real numbers? It's hovering around 300k users. But the good news is it can handle those 50M if they ever come :-) It is highly over-engineered.

Depends on the flexibility you want and your product's strategy. Bitly could do the same thing with objects in S3, served by Cloudfront, and get their analytics async using Cloudfront's logs. 2 engineers minimum, 3 for good measure. But that doesn't leave you much room to go anywhere else with your business.

"Simple text based messaging over TCP" doesn't sound too different from serving http.

Stop being so negative, brah.

Hipster-irrelevant engineering context is generally underappreciated around here (unless it's that one story about "Mel"). Especially when presented with condescension, swift downvotes result.

Hipster-irreverent, too.

Let me worship your awesomeness.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact