Friday, October 19, 2012

One big cluster: How CloudFlare launched 10 data centers in 30 days

The inside of Equinix's co-location facility in San Jose—the home of CloudFlare's primary data center.

On August 22, CloudFlare, a content delivery network, turned on a brand new data center in Seoul, Korea—the last of ten new facilities started across four continents in a span of thirty days. The Seoul data center brought CloudFlare's number of data centers up to 23, nearly doubling the company's global reach—a significant feat in itself for a company of just 32 employees.

But there was something else relatively significant about the Seoul data center and the other 9 facilities set up this summer: despite the fact that the company owned every router and every server in their racks, and each had been configured with great care to handle the demands of CloudFlare's CDN and security services, no one from CloudFlare had ever set foot in them. All that came from CloudFlare directly was a six-page manual instructing facility managers and local suppliers on how to rack and plug in the boxes shipped to them.

"We have nobody stationed in Stockholm or Seoul or Sydney, or a lot of the places that we put these new data centers," CloudFlare CEO Matthew Prince told Ars. "In fact, no CloudFlare employees have stepped foot in half of the facilities where we've launched." The totally remote-controlled data center approach used by the company is one of the reasons that CloudFlare can afford to provide its services for free to most of its customers—and still make a 75 percent profit margin.

In the two years since its launch, the content delivery network and denial-of-service protection company has helped keep all sorts of sites online during global attacks, both famous and infamous—including recognition from both Davos and LulzSec. And all that attention has amounted to Yahoo-sized traffic—the CloudFlare service has handled over 581 billion pageviews since its launch.

Yet CloudFlare does all this without the sort of Domain Name Service "black magic" that Akamai and other content delivery networks use to forward-position content—and with only 32 employees. To reach that level of efficiency, CloudFlare has done some black magic of a different sort, relying on open-source software from the realm of high-performance computing, storage tricks from the world of "big data," a bit of network peering arbitrage and clever use of a core Internet routing technology.

In the process, it has created an ever-expanding army of remote-controlled service points around the globe that can eat 60-gigabit-per-second distributed denial of service attacks for breakfast.

Routing with Anycast

CloudFlare's CDN is based on Anycast, a standard defined in the Border Gateway Protocol—the routing protocol that's at the center of how the Internet directs traffic. Anycast is part of how BGP supports the multi-homing of IP addresses, in which multiple routers connect a network to the Internet; through the broadcasts of IP addresses available through a router, other routers determine the shortest path for network traffic to take to reach that destination.

Using Anycast means that CloudFlare makes the servers it fronts appear to be in many places, while only using one IP address. "If you do a traceroute to Metallica.com (a CloudFlare customer), depending on where you are in the world, you would hit a different data center," Prince said. "But you're getting back the same IP address."

That means that as CloudFlare adds more data centers, and those data centers advertise the IP addresses of the websites that are fronted by the service, the Internet's core routers automatically re-map the routes to the IP addresses of the sites. There's no need to do anything special with the Domain Name Service to handle load-balancing of network traffic to sites other than point the hostname for a site at CloudFlare's IP address. It also means that when a specific data center needs to be taken down for an upgrade or maintenance (or gets knocked offline for some other reason), the routes can be adjusted on the fly.

That makes it much harder for distributed denial of service attacks to go after servers behind CloudFlare's CDN network; if they're geographically widespread, the traffic they generate gets spread across all of CloudFlare's data centers—as long as the network connections at each site aren't overcome.

In September, Prince said, "there was a brand new botnet out there launching big attacks, and it targeted one of our customers. It generated 65 gigabits per second of traffic hitting our network. But none of that traffic was focused in one place—it was split fairly evenly across our 23 data centers, so each of those facilities only had to deal with about 3 gigs of traffic. That's much more manageable."

Net-rich, power-poor

Making CloudFlare's approach work requires that it put its networks as close as possible to the core routers of the Internet—at least in terms of network hops. While companies like Google, Facebook, Microsoft, and Yahoo have gone to great lengths to build their own custom data centers in places where power is cheap and where they can take advantage of the economies of scale, CloudFlare looks to use existing facilities that "your network traffic would be passing through even if you weren't using our service," Prince said.

As a result, the company's "data centers" are usually at most a few racks of hardware, installed at co-location facilities that are major network exchange points. Prince said that most of his company's data centers are set up at Equinix IBX co-location facilities in the US, including CloudFlare's primary facility in San Jose—a facility also used by Google and other major cloud players as an on-ramp to the Internet.

CloudFlare looks for co-location facilities with the same sort of capabilities wherever it can. But these sorts of facilities tend to be older, without the kind of power distribution density that a custom-built data center would have. "That means that to get as much compute power as possible into any given rack, we're spending a lot of time paying attention to what power decisions we make," Prince said.

The other factor driving what goes into those racks is the need to maximize the utilization of CloudFlare's outbound Internet connections. CloudFlare buys its bandwidth wholesale from network transit providers, committing to a certain level of service. "We're paying for that no matter what," Prince said, "so it's optimal to fill that pipe up."

That means that the computing power of CloudFlare's servers is less of a priority than networking and cache input/output and power consumption. And since CloudFlare depends heavily on the facility providers overseas or other partners to do hardware installations and swap-outs, the company needed to make its servers as simple as possible to install—bringing it down to that six-page manual. To make that possible, CloudFlare's engineering team drew on experience and technology from the high-performance computing world.

The magical pixie-booted data center

"A lot of our team comes from the HPC space," Prince said. "They include people who built HPC networks for the Department of Energy, where they have an 80 thousand node cluster, and had to figure out how to get 80,000 computers, fit them into one space, cable them in a really reliable way, and make sure that you can manage them from a single location."

One of the things that CloudFlare brought over from the team's DoE experience was the Perceus Provisioning System, an open-source provisioning system for Linux used by DoE for its HPC environments. All of CloudFlare's servers are "pixie-booted"  (using a Preboot eXecution Environment, or PXEacross a virtual private network between data centers; servers are delivered with no operating system or configuration whatsoever, other than a bootloader that calls back to Perceus for provisioning. "The servers come from whatever equipment vendor we buy them from completely bare," Prince said. "All we get from them is the MAC address."

CloudFlare's servers run on a custom-built Linux distribution based on Debian. For security purposes, the servers are "statelessly" provisioned with Perceus—that is, the operating system is loaded completely in RAM. The mass storage on CloudFlare servers (which is universally based on SSD drives) is used exclusively for caching data from clients' sites.

The gear deployed to data centers that gets significant pre-installation attention from CloudFlare's engineers is the routers—primarily supplied by Juniper Networks, which works with CloudFlare to preconfigure them before being shipped to new data centers. Part of the configuration is to create virtual network connections over the Internet to the other CloudFlare data centers, which allows each data center to use its nearest peer to pull software from during provisioning and updating.

"When we booted up Vienna, for example," said Prince, "the closest data center was Frankfurt, so we used the Frankfurt facility to boot the new Vienna facility." One server in Vienna was booted first as the "master node," with provisioning instructions for each of the other machines. Once the servers are all provisioned and loaded, "they call back to our central facility (in San Jose) and say, 'Here are our MAC addresses, what do you need us to do?'"

Once the machines have passed a final set of tests, each gets designated with an operational responsibility: acting as a proxy for Web requests to clients' servers, managing the cache of content to speed responses, DNS and logging services. Each of those services can be run on any server in the stack, and step up to take over a service if one of its comrades fails.

Caching and hashing

Caching is part of the job for every server in each CloudFlare facility, and being able to scale up the size of the cache is another reason for the modular nature of how the company thinks of servers. Rather than storing cached webpage objects in a traditional database or file system, CloudFlare uses a hash-based database that works in a fashion similar to "NoSQL" databases like 10gen's MongoDB and Amazon's Dynamo storage system.

When a request for a webpage comes in for the first time, CloudFlare retrieves the site contents. A consistent hashing algorithm in CloudFlare's caching engine then converts the URL used to call each element into a value, which is used as the key under which the content is stored locally at each data center. Each server in the stack is assigned a range of hashes to store content for, and subsequent requests for the content are routed to the appropriate server for that cache.

Unlike most database applications, the cache stored at each CloudFlare facility has an undefined expiration date—and because of the nature of those facilities, it isn't a simple matter to add more storage. To keep the utilization level of installed storage high, the cache system simply purges older cache data when it needs to store new content.

The downside of the hash-based cache's simplicity is that it has no built-in logging system to track content. CloudFlare can't tell customers which data centers have copies of which content they've posted. "A customer will ask me, 'Tell me all of the files you have in cache,'" Prince said. "For us, all we know is there are a whole bunch of hashes sitting on a disk somewhere—we don't keep track of which object belongs to what site."

The upside, however, is that the system has a very low overhead as a result, and can retrieve site content quickly and keep those outbound pipes full. And when you're scaling a 32-person company to fight the speed of light worldwide, it helps to keep things as simple as possible.

Holly Combs Kim Cooper

No comments:

Post a Comment