Panel Discussion - Building in the Clouds: Scaling Web 2.0

Panel: Jason Hoffman (Joyent), Alistair Croll (Bitcurrent), Alex Barnett (From Bungee Labs to Intuit), Dwight Merriman (10Gen), Jinesh Varia (AWS), Pete Koomen (Google)

Panel session driven by Q&A.

Q) Decision between a component centric cloud and a service centric cloud? In a component centric I need to add instances to my app cluster (i.e. AWS), and in a service centric I write for a specific framework that scales itself (i.e. AppEngine). When does it make sense to focus on each?

A) Hoffman: I think they’ve already converged. It depends on the situation and you do both. The web app tiering has long been dead. You’re already silo’ing your assets. People are going to look at a given functionality in their site and ask what’s the service behind it?

Koomen: With App Engine it’s designed to handle low latency web applications.

Varia: Component clouds are great for flexibility. As the abstractions increase you lose flexibility and you also face lock-in on a technology stack.

Barnett: Scaling for what and why? How much up front consideration do start-ups need to put into becoming scalable? If you’ve only got a set of resources that isn’t infinite how do you face it? The nature and the type of the application will have fundamental implications to the underlying design.

Hoffman: Most web apps don’t have to scale in any reasonable amount of time. Another scaling issue is when you start out bigger and you don’t get enough traffic and have to scale down.

Q) Centralized computing & Distributed computing. Tension going on between centralized and distributed computing. Google has been buying thousands of net scalars and just this morning Amazon announced the cloud delivery service.

A) Merriman: Interesting fact that CDNs are one of the first forms of cloud computing. It’s an easy way to distribute content. Definitely use CDN for static.

Koomen: Scaling is about reducing the constant factor. Has to do with minimizing the amount of work you’re doing in the central server. Whether it’s in the CDN or the client side. It’s about a mentality of reducing what you’re doing on every request.

Hoffman: Amazon was smart about coming out with S3 before coming out with EC2. If you’re dealing with datasets less than a terabyte in size.

Varia: We have been listening a lot. From a scalability perspective many people needed data closer to their customers. Amazon is opening a CDN in 3 continents where the static data will be available from S3 with lower latencies and higher data transfer rates. Customers running RIAs feel it is key to serve content faster.

Q: How much can the edge help?

Hoffman: Outside of serving static content like images the edge doesn’t do anything.

Merriman: I don’t know that I agree with that because if you’re serving to data in Japan.

Hoffman: WAN optimization and network optimization is quite different than edge caching.

Q) How do you measure capacity and performance? What are the metrics you look at?

Kooman: Google cares a lot about CPU and latency. We can scale disk easy.

Hoffman: I think that’s the opposite end. There are things that take up space or move space. Disk space, CPU space, and network space. Then there’s the moving two and from these things. Most people in the real world are not coding against the CPU or CPU bound in a web app. Nobodies writing webapps that saturate the band that comes out of a single server. It takes a long time to fill up a terabyte. What people need is memory and better disk I/O. People still use relational databases. Disk I/O is the main thing.

Barnett: We also worry a lot about the end user experience. We’ve instrumented the AJAX library coming down to track every mouse click and interaction that an application has at a very granular level. You’re able to measure every click in a matter of milliseconds every single click and the latency on web service calls.

Hoffman: There doesn’t currently exist tooling to take end-user experience and feed that all the way back to capacity planning.

Merriman: We had to serve 10 – 20 billion ads per day. There’s a lot of CPU involved in picking which ads to serve. Other issue was just the database. “Have you seen this ad before? How many times?” Lots of data you access in real time and on the back-end on event processing. We looked at CPU a lot and I/O utilization on the database servers.

Varia: At Amazon, metrics is the key. From individual developer, to business, to our whole organization. From a developer we measure in time byte hours which is how much data that person is storing and how it grows. From S3 we measure the number of objects stored (22 billion objects stored) and the number of transactions. We peak at 50,000 transactions per second. We stay ahead of the curve. On the business side we need to understand our segmentation of large, medium, and small businesses.

Barnett: It’s interesting that when we charge for services on a utility model we

Koomen: We’re not going to be able to prevent people from taking out cloud services if they write bad code. So it’s important for us to be able to figure out where the problems exist and bubble that up to the user so they’re not making bad decisions.

Q: How do you guys deal with one rogue app?

Varia: Animoto is a very cool web 2.0 application where you upload your photos and music tracks in a way that it creates a really cool video out of it rendering your photos. They created a Facebook app they went from 25,000 users total, they went to adding 25,000 users every hour. Scaled from 50 servers to 5000 servers in 2 days. They were able to do this because they are built on a cloud platform. They scaled it down during the night time to save on cost. Some of these applications are bursting, no doubt. On an aggregate level the curve is pretty smooth. Amazon takes tremendous pride in figuring out how to add servers and services.

We have certain limits which prevent developers from starting 1000 instances. You are capped at 20 initially. If you want more you have to talk to us. There are security and safety mechanisms in place. If a business has a valid business case we’ll flip the switch.

Hoffman: If you have to spin up new virtual machines to handle traffic bursts you’re going to miss the burst.

Koomen: We deal with bursts like that by dealing with every request agnostically. To address the question from our side on what you do to prevent the users from exploiting a system. We’ve got quotas that measure what individual applications can consume and some knobs to turn.