Cal Henderson (Flickr) - Scalable Web Architectures: Common Patterns and Approaches
"This is a slightly nerdy talk."
Flickr - "Large scale kitten-sharing" - Awesome
In a single second Flikr serves 40,000 photos, 100,000 cache operations, 130,000 database queries
What is scalability? Not raw speed / performance, high-availability, technology x, technology y.Scalability is: Traffic Growth, Dataset Growth, Maintainability
Two kinds of scale: Vertical (get bigger), Horizontal (get more) [Cal's definitions are opposite of tradition.]Sometimes buying big hardware is right: buying a box is quicker and easier than rewriting software. Running out of MySQL performance? Spend months on data federation OR just buy a ton more RAM. Vertical scaling isn't interesting because it just involves paying.
What is an architecture? The way the bits fit together. What grows where. Trade-offs between good/fast/cheap. We'll be talking LAMP stack today. "LAMP is the best one."
Application servers scale in two ways: Really well and quite badly. Cal is a proponent of the first approach. It all comes down to sessions and state.
- Local sessions == bad (when they move == quite bad)
- Stored on disk is really bad. "Almost as bad as storing on tape drives."
- Stored in memory better - but all still really bad because it's not fault tolerant, can't avoid hot spots, can't move users.
- Centralized sessions == good
- Store in central database, pushes the issue down the stack
- No sessions at all == awesome!
- Store session information in a cookie - some people are like "ahh cookies evil" haha
- Stash the user id and name and sign it with a timestamp (easy to expire and validate)
- Super-slim sessions
- Store more interesting information - only in the cookie - base 64
- Need more than the cookie then pull from DB or an account cache
- None of the drawbacks of sessions and avoids overhead of query-per-page
The Rasmus (PHP inventor) Way
- App Server is 'shared nothing'
- Push problems down the stack
- Sweet diagram featuring a trifle
By using pairs of load balancers and switches we can gain high availability over a single load balancer. Then replicated in data servers in remote locations - an we can keep adding more app servers.
So scaling the web app is easy - scaling the data store is hard - the talk will be covering the latter.
Amazon - all services scale horizontally - cheap when small but not cheap with large enough economies of scale.
3. Load Balancing
Hardware or software (Layer 4 or 7 on the OSI stack) balancing.
Hardware load balancing is the approach nearly everyone took and is expensive. Often need a pair communicating with heartbeats. High performance easy to do > 1 Gbps. Leading brands: Alteon, Cisco, Netscalar, Foundry. "They'll market you and tell you they're awesome, they're not that awesome, they all do the same thing."
Software load balancing is such a simple task that you can run it on existing boxes. IE you can have only 2 machines each can be running both a load balancer and a web server and they can balance each other. Hard to have high availability with software. Lots of options: Pound is really simple, perlbal, mod_proxy (Apache processes are big though), wackamole with mod_backhand. Wackamole is gaining popularity and is "kind of awesome." Machines get virtual IPs. When a box dies it migrates virtual IPs to boxes which are still alive. Neat. "Awesome. Free."
Layer 7 balancing - we can look at the URL, compute a hash on it, and send you to a particular server. Very useful for serving things out of a cache.
OPTIONS command in HTTP - "Hey I know you're a webserver, what are you capable of?" No one uses it but load balancers often do.
Parallelizable == easy! - If we can transcode/crawl in parallel, it's easy - but the web ain't built for slow things.
Asynchronous queuing involves sending request to web server and getting a reply immediately saying 'yep, we got it', stuff happens later.
Twitter - "It's like a web based service for telling people what you had for lunch." Twitter queues. We don't care if everyone gets it immediately - we just kind of hope/assume it happens. Flickr does the image resizing asynchronously. Lets you know when your tasks are done with multiple message passes.
5. Relational Data
Database is the hardest part to scale. If we can, best to avoid the issue altogether and just buy bigger hardware. Dual-quad opteron with 16GB RAM can get you a long way.
What do you do when the hardware can't keep up? We do a lot more reading than writing. Somewhere between 80/20 or 90/10 ratio of read/write. "Only a few people who like to caption pictures of kittens on the internet - but a whole lot of people who like to look at them."
Replication is the solution for the read write problem. Master-slave replication. Add more slaves, 1 master 3 slaves = 4 times the read power. Flickr is 6 reads for every write. Problem is what happens when we need to push more writes? As we need more writes this doesn't scale well - have to add a lot more boxes.
Avoids needing to scale or makes it cheaper. mod_perl / shared memory is simple but invalidation is hard. MySQL query cache has bad performance in most cases. MySQL query cache: for any read query stores a pointer to the result if you perform the exact same query. Any write flushes it, though. So if you do 10 reads for every 1 write you're unlikely to get a hit before cache is invalidated. If you only write once an hour or day then this is a really good thing. Otherwise turn it off.
Write-through cache - sites between app server and database. Sits between app server and database so app server writes to cache and then the cache writes to the database.
Write-back cache - write directly to the cache. Sometime later the cache writes to the database. "We can add redundancy and make it fault-tolerant but congratulations you've turned it into a database."
Sideline cache - In general, in the LAMP stack, we typically do sideline caching. We handle the caching logic in our own applications. We perform the write directly to the database from the app server and then we massage the cache with application specific logic. In theory easy to implement - the difficult bit is invalidating the cache. Memcached from Danga/LiveJournal is what most are using. Yahoo, Facebook, Digg, Flickr, lots of people using it.
Where do we store the cache? Use balancing. Using layer 4, good: cache can be local on the machine, bad: invalidation is hard with high node counts and space is wasted by duplicate objects. More people are using layer 7 and inspecting URLs, good because no wasted space and scales linearly, bad because we need multiple remote locations when lots of things are cached in different places. Can be avoided with proxy but this gets complicated. Talking about hashrings being a good thing - someone has done it for memcached.
What do you do when a master goes down? Used to be people freaked out, haha. "You're not going to hit 5 9's by freaking out for a day." Key to HA is to identify single points of failure "eliminate single points that will fuck up everything haha". Use multiple points instead of single points. Some stuff is hard to solve so if you push the problem up the tree it gets easier. "If a processor dies should we figure it out and recover? No, just have two machines." Master-master replication is becoming trendy. Make sure your IDs are unique in MySQL or bad things happen. Hot/warm writes go from one master to another, hot/hot writes go from either to the other. Hot/hot has to deal with collisions. No auto-increment for hot/hot until MySQL 5 (but can't rely on ordering). "Sucky thing about replication is replication lag." With master-master we can put a write on either side that conflicts and be replicated at the same time to wind up with two columns forever out of sync. Design schema and access to avoid collisions. Hash users to servers. Master-master is really just a small ring. Bigger rings are possible (nobody does this with master rings, if one goes down things get bad). Each slave may only have a single master.
N + M where N = nodes needed to run the system, M = nodes we can afford to lose. "Having M as big as N starts to suck." NDB (MySQL) allows a mesh. Supports for replication out to slaves is on its way "we're a year closer to that being the truth."
8. Data Federation
At some point you need more writes and this is tough. Each cluster has limited write capacity. Trick is add more clusters. Vertical partitioning divides tables that don't need to be joined. Never need to join kittens and preferences. Run into limits, still, tables become too large and you run out of groups. Data federation is about splitting up large tables organized by a primary object, usually by users. Have one central point for look-ups to find which cluster an object resides on. "Users 1,2,3,4,5 are stored on cluster 1." Flickr has 70 some chunks/clusters. Over time you need to be able to migrate resources between shards (hard with 'lockable' objects). Flickr has a locking mechanism to prevent you from accessing your account if they are migrating it somewhere.
Wordpress.com approach. Hash users into one of n buckets. Put all the buckets on one server. When you run out of capacity, split buckets across 2 servers. When you run out of capacity they'll split the buckets across double the number of servers. Don't need global lockup - but have to migrate large chunks at once and go down for an hour. "Interesting approach, not sure of anyone else who has taken it."
Upsides of federation: heterogeneous hardware is fine, just allocate larger buckets to bigger boxes. Opportunity to put paying users on bigger and faster hardware (LiveJournal did this). Downsides are you have to keep data in the right place and application logic gets a lot more complicated. Management of more clusters is a pain with backups and maintenance. More database connections needed per page, proxy can help but complicated. "Facebook looks at you and your social network and tries to move all of you and your friends onto the same shard. Advanced stuff. I don't think anyone other than Facebook tries to do it." The big problem is how to split it in a sensible way. Dual table issue, avoid "walking the shards", "if comments are kept on same shard as the picture - how do we get all comments for a user". "If you have data which can be organized for more than one puprose (comments by photo/user) then have two tables for the table for each view and deal with it in app logic. Perform two writes. If we modify - two writes. If we delete - two deletes." At Flickr they don't do distributed transactions - they do a transaction at each and try to close at the same time. Worst case scenario is one wouldn't commit and you would be out of sync. In reality ok. Things can get a bit out of sync. So they repair things on the fly and have administrative tools which repair tables that may go out of sync. "Consistency is a problem." Interesting strategy: keep one normalized copy of data and create copies which are not normalized.
Data federation is how large applications are scaled. Period. Hard but not impossible. Good software design makes it easier. Master-master makes for High Availability within shards. Master-master trees work for central cluster.
9. Multi-site HA
Having multiple datacenters is really hard. Hot/warm with MySQL slaved setup but manual reconfig on failure. "Facebook serves pages out of two datacenters, only one where they do all the writes. WordPress does that. Bunch do it. Not HA if the hot datacenter fails." Hot/hot with master-master is dangerous, each site has a single point of failure. Nobody really does this. Hot/hot with sync/async manual replication is a tough, big engineering task.
Global server load balancing. Easiest solution is AkaDNS-like service. Akamai manages your domain and Akamai figures out which IP to give based on either performance (minimize latency) or balance (send 50% here and 50% here) concerns. If you only have 2 data centers then each of them needs to have the capacity to support the other. Cheaper to have more, smaller datacenters than fewer, bigger.
10. Serving Files
Virtual versioning. Database indicates version 3 of file, browser requests example.com/foo_3.jpg, requests comes through cache and is cached with the versioned url, mod_rewrite converts versioned URL to the path. So you don't serve old versions. Flickr does this with CSS.
Reverse proxy choices: layer 7 load balancers, i.e. squid, mod_proxy, mod_cache, HA proxy, Squid & Carp, Varnish (better than squid if working set is smaller than memory, "ruby kids love it, trendy").
What if you authenticate? Proxies with a token. Perlbal re-proxying can do redirection magic after verifying user credentials, perlbal goes and picks up the file from somewhere else. User doesn't find out how to access the file directly. Permission URLs are another way: bake the auth into the URL and it saves the auth step. Do this with tokens in the URLs which can be checked without having to go to the database. Self-signed hash. Magic translation from a hex string to a path. At Flickr they don't have to access the datastore to determine permission. "When you look at a Flickr JPG URL we can turn that into a path on disk without any additional info." Downsides to permission urls: gives you permission for life and you can't choose later to expire without storing it in a database which defeats the purpose. Can set a maximum lifetime. This is what Amazon S3 does. Upsides are it works and scales very nicely. "It would more than double the traffic to our database if we had to check permissions for everything."
11. Storing Files
Storing files is easy: get a big disk, and a bigger disk. Horizontal scaling again comes to the rescue. NFS, stateful, sucks. SMB / CIFS / Samba turn off MSRPC & WINS (NetBOIS NS) stateful but degrades gracefully. HTTP stateless (great!) just use Apache. RAID is good, RAID 5 is cheap, RAID 10 is fast. Disks die all the time. Move the problem up the stack again - if you have data colocated do you really need RAID? RAID rebuilds itself, machine replication doesn't. Big appliances self heal, so does MogileFS. Dives into the Google File system on high level points. MogileFS - OMG Files - Developed by Danga/SixApart, OpenSource, designed to be scalable for many small files. MogileFS looks interesting, need to read more about it. S3 is very cheap until you get to a certain massive scale.
Flickr file system which is proprietary to Flickr. FSFS - doesn't make sense, don't know what first S stands for. Many petabytes of storage over 6 data centers. Doesn't have metadata storage, user deals with it yourself. Multiple 'storagemaster' nodes. Metadata stored by App, just a virtual volumne number the app keeps track of. App chooses a path. Virtual nodes are mirrored in at least 2 colos at a time. Reading is done directly from nodes, writing happens through storagemaster nodes. Reading and writing scales separately by adding more storagemasters for writing or more squids for more reading.
12. Field Work
Flickr architecture with a sweet slide. LiveJournal with a sweet slide. Pictures to follow. Same architecture used more or less with Digg, WordPress, Facebook, etc. Not going to serve files? Chop off the left. Theo Schlossnagle's Scalable Internet Architectures book is highly recommended.
Cal Henderson describes himself as a "Flickr architect, PHP programmer, author and chronic complainer."
His wikipedia bio: Cal is best known for co-owning and developing the online creative community B3ta with Denise Wilton and Rob Manuel; being the chief software architect for the photo-sharing application Flickr (originally working for Ludicorp and then Yahoo) and writing the book Building Scalable Web Sites for O'Reilly Media. He's also worked for EMAP and is responsible for writing City Creator among many other websites, services and desktop applications.