Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years (highscalability.com)
261 points by aespinoza on April 15, 2013 | hide | past | favorite | 45 comments

I'm fairly amazed by some of the choices - in particular the single huge user table that is hit on every profile page view to map url to shard. I'm guessing it has hot backups and similar - but wow - talk about a single point of failure. I guess having it like that actually makes it easier. If it fails, it should be trivial to get it back up quickly on a backup machine.

The whole setup seems a lot simpler than nearly every 'web scale' solution I've heard discussed.

EDIT: I'm not saying that is a bad thing - it's a good thing. Most of the articles around scaling on HN often deal with very complex Cassandra style technical solutions. This article is the opposite with a very 'just do it the simplest way' feel to it. One big mysql table for all users (probably with some memcache in front) is the sort of thing you can put together in an hour or two - yet it is the very core of their system.

I (and I bet a lot of other engineers reading HN) try to focus too much on perfect implementations - if someone asked me to design the shard look-up for something as heavily hit as this, I'd probably reach for a distributed in-memory DB with broadcasted updates - and I'd probably be wrong to do so. The mysql/memcache approach is just going to be simpler to keep running.

You don't really want to get into the alternatives. It's pretty common to shard everything but the core user data (email, PBKDF2, username) and scale that with slaves and vertical upgrades. SSDs can take you a long way if you're fastidious about offloading everything else off the machine responsible for keeping user accounts consistent.

The alternatives aren't nice.

I agree. Usually it is better to scale vertically on the user data. Even if it seems painful. The alternatives like partitioning are not really that easy either. There is pain involved as well.

I think, of the existing solutions, one is not better than the other. The choice in the end is which one fits best with your team and infrastructure, and specially which disadvantages are you willing to put up with.

One table for users is a no-brainer. The outer limit on table size is ~7 billion rows.

And that's just not a very big table these days.

That’s not what they’re doing. They are sharding horizontally on the user. ID’s are 64 bits with the db location (logical shard id) in the first 16 bits. This will allow anyone with an ID to go to the right db for querying, granted they have logical to physical shard mappings.

From the article:

User table is unsharded. They just use a big database and have a uniqueness constraint on user name and email. Inserting a User will fail if it isn’t unique. Then they do a lot of writes in their sharded database.

Get the user name from the URL. Go to the single huge database to find the user ID.

I believe the parent is correct.

I see. We tackle the same issue by having a redis keyspace with emails. It can get pretty huge without any problems.

This might be a silly thing to ask, but why don't they save their data in flat files on a shared filesystem?

They wouldn't need Memcached, since the OS caches files anyway; replication with rsync becomes easy; they don't need transactions anyway; and they wouldn't have so much software to manage.

Storing, tracking and managing billions of tiny files directly on a file system is a nightmare

What about it is a nightmare?

Isn't storing, tracking, and managing billions of entries done directly in a database anyway?

Its a real pain when you want to inspect the files, delete or copy them.

Try taking 300,000 files and copy them somewhere. Then copy 1 file which has the size of the 300,000 combined. The single file is MUCH faster (its also why we usually do a tar operation before copying stuff if its already compressed). Any database that's not a toy will usually lay the 300,000 records out in a single file (depending on settings, sizes and filesystem limits).

The 300,000 files end up sitting all over the drive and disk seeks kill you at run-time. This may not be true for a SSD but I don't have any evidence to to suggest this or otherwise.

Even if the physical storage is fine with this I suspect you may run into filesystem issues when you lay out millions if not hundreds of millions of files over a directory and then hit it hard.

I have played with 1,000,000 files before when playing with crawling/indexing things and it becomes a real management pain. It may seem cleaner to lay each out as a singe file but in the long run if you hit a large size the benefits aren't worth it.

In addition,

It doesn't have good querying utilities. You'd have to build your own indexer and query engine. Since you can't put billions of files in a single directory, you'd have to split them into a directory tree. That alone requires some basic indexing functionality and rebalancing tools (in case a single directory grows too large or too small). This is without any more sophisticated querying capabilities like “where X=val” where X isn't the object ID/filename.

Write performance is going to be very very horrible.

Existing tools around managing, backing up and restoration, performance optimisation and monitoring aren't suitable for handling huge number of small files as well as a subset of them (give certain criteria related to the data itself)

You could build specialized tools to resolve all of these issues, but in the end, you'd end up with some kind of database after hundreds of man-years anyway.

Since they're MySQL pros, they could put that on a MySQL Cluster and get high availability (up to 30 or so machines in the cluster, use a larger replica count, on top of drive mirroring, etc.) Also, it's a read-heavy workload, so hide it behind memcached.

I really like how when you look at it, most of their scaling was removing things from the environment and really honing the platform down to what it needed to be.

> Architecture is doing the right thing when growth can be handled by adding more of the same stuff. You want to be able to scale by throwing money at a problem which means throwing more boxes at a problem as you need them. If you are architecture can do that, then you’re golden

My understanding is that the term of art is horizontal scalability.

How does SOA improve scaling? I've never had to build anything as big as pinterest and it seems interesting to me.

I don't quite understand how it reduces connections. If you have your front-end tier connecting to your service tier, you'll still have the same amount of requests going to the database, just with a little middle-man... If I understand correctly?

I'm no expert, but hopefully by abstracting out different services from your core app and making them available through some connection protocol (Thrift, a message queue, http, whatever) you can more easily scale up and scale out systems that become hotspots, and consumers of that service will never be any the wiser, since they only interact with it through the published API.

Looking at it another way, when I talk to the GitHub API, I have no idea what sort of database I'm ultimately talking to. Their interface abstracts away all of that complexity. Each request they receive from me might be handled by a different machine for all I know. It makes no difference.

By itself SOA doesn't do anything but it does free you up to do other things that allow you to scale like read-write separation and event driven or fan-out caching. One of the most important thing SOA does is that it quickly stops you from doing dirty hacks that work right now but are killers for future work. It's encapsulation at the application level

Service orientation allows specialization of architectural components. If you want to build on scale, it is easier to replicate specialized components than a generalized one. It also benefits from better fault tolerance.

Additionally, services layer forms an interface between the client and service provider, which makes it easier to scale.

2 questions: 1) What do they do with the Pinterest-generated ID? Do they store it with each row in addition to the local ID? 2) Why randomly select a shard? Shouldn't you do this based on DB size across all boxes, or at least based on physical space left across all boxes?

We decompose 64 bit ids to shard_id, type and local_id. We connect to the shard go to the table and read the local_id.

We randomly select a shard for a new user only. Ofcourse we capacity plan this and open new shards and even out users eventually.

Oops, 1 is directly mentioned in the talk.

What is most interesting to me is that in many ways they have reinvented Couchbase. I think that the only reason they didn't go with this technology was the financial cost for their scale was too high.

I wonder if their dislike for Cassandra is based on previous versions pre-2.0. From my perspective of looking at it as it stands now, it's pretty compelling.

Not if you want strong consistency. Cassandra's performance sucks in comparison with the likes of MongoDb or Couchbase when reading with strong consistency since the clients have no idea of the server topology.

umm what? Cassandra is just as fast/faster (depending on both configurations and load) compared to MongoDB with consistant read/writes. Definitely with writes but reads get tricky.

Firstly, these sorts of applications are always going to be more read heavy so the reads are more important. Secondly, Cassandra cannot and will not be as good as something like Couchbase since the client libraries are not aware of the server topology so they cannot make direct connection to the server hosting the data. This means that depending on your consistency requirements, Cassandra will be merely okay to occasionally terrible depending on whether you care about 99th percentile. This behaviour was one of the reasons my company moved away from Cassandra

This is probably the best benchmark of Cassandra/MongoDb/Couchbase http://www.slideshare.net/renatko/couchbase-performance-benc...

This benchmark is pure marketing. Cassandra clients can be token aware and can do connections directly to the right node.

If you want a scientific, independent, peer-reviewed NoSQL benchmark, read this: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf

Cassandra is a clear winner here.

No, the benchmark is not pure marketing. Why would you claim that it is? Apart from Astyanax, which clients are token aware?

That paper is very useful so thanks for posting the link but it has a number of issues as I see it.

1) It considers Cassandra, Redis, VoltDB, Voldermort, HBase and MySQL. It does not cover either MongoDB or Couchbase.

2) Latency values are given as average and do not show p95/p99. In my experience, Cassandra in particular is susceptible to high latency at these values.

3) Even considering average values, the read latency of Cassandra is higher than you would see with either MongoDB or Couchbase.

4) Cassandra does not deal well with ephemeral data. There are issues while GC'ing large number of tombstones for example that will hurt a long running system.

The long and short of it is that Cassandra is a fantastic system for write heavy situations. What it is not good at are read heavy situations where deterministic low latency is required, which is pretty much what the pinterest guys were dealing with.

It is marketing, because Couchbase is a featured customer of Altoros, the company that did the benchmark. And the rule of thumb is: never trust a benchmark done by someone who is related to one of the benchmarked systems. Obviously they'd not publish it if Couchbase lost the benchmark. They must have had been insane to do it.

Another reason it is marketing is because it lacks essential information on the setup of each benchmarked system. E.g for Cassandra I don't even know which version they used, what was the replication factor, what consistency level did they read data at, did they enable row cache (which decreases latency a lot), etc.? Cassandra improved read throughput and latency by a huge factor since version 0.6 and is constantly improving so the version really matters.

First, let me concede that Cassandra has had a storied history of terrible read performance. However, if the last time anyone looked at Cassandra for read performance was 0.8 or used size-tiered compaction, I'd encourage them to take another look.

The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.

I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.

As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.

Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.

Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.


This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.

Cassandra is a real winner in that study only if you need on the order of 100K ops/sec. Otherwise high latency can be a killing factor.

You probably mean pre 1.2 as 2.0 isn't there yet.

DataStax did call it 2.0 and now 3.0

DataStax Enterprise 2.0 shipped with Cassandra 1.0 DataStax Enterprise 3.0 shipped with Cassandra 1.1

DataStax Enterprise is a different product than Cassandra. Cassandra is one of its components, but there are more things bundled, e.g. Solr and Hadoop.

great talk, inspired me to simplify the tech stack.

Pyres looks like a good alternative for Celery in the Python world when you don't need Celery's cron-like features.

Very cool, I thought pinterest came out of nowhere and is now everywhere, but i didn't realize how fast they did it and how large they've grown.

thanks for posting!

Does someone now what programming language and framework they use? Still Python+Django? PyPy?

From what I've read, it's Python and "heavily modified" Django. I recall one of their engineers saying that, if they were to do it again, they would've gone with a more lightweight Python framework because of how many heavy modifications they made.

I have read that also (multiple times actually) and the statement never made sense to me - of course in hindsight you would start with something like Flask because you ran into all of the scaling problems with Django, but there would be no way to fix these problems before they occurred - and if you did try to fix them before they would occurred, it would be premature optimization. It seems like a example of what Taleb's calls the "Narrative Fallacy"


what happens if 1 shard contains a whole slew of large companies (macy's, gap, etc), or over time users on one shard grow to the point that they cannot be contained within one physical machine? Are they just assuming that this will be very unlikely?

No not at all. We already have this problem. Right now we are able to take the DB hit but we will eventually attack this problem from cache perspective. We will have replicated caches for this scenario. In worst case if a shard is overloaded we stop creating new users in it and put it on a dedicated physical machine. I dont think we will let the situation get there.

Great article, thanks for sharing.

More specifics about the decision to dump MongoDB would be extremely useful.

A simply awesome article, that I would refer to back and again, can someone please refine this with his comments as well. I mean it would be really helpful to point out any mistakes Pinterest did, at any point, what I want to know is, what could they have done better. Also, business is a dynamic environment, so somethings might have changed, so can someone please also point out, if something was right when pinterest did it, but now there are better alternatives. Thanks a lot in advance. I would be really reading many of the comments properly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact