Hotmail Systems Engineering

There's a good interview at ACM with Phil Smoot, an engineer on the Hotmail project and a product manager for MSN. The interview attempts to address issues of operations and systems scaling on an Internet-scale service and as such is interesting to me. It's also full of some silly platitudes: comparing hotmail to the Everest of "megaservices" even though it is several orders of magnitude smaller than some competing services and applications like Google Search or Yahoo! Search, for example.

What I found interesting about the article was how few specifics Smoot was willing to give up about how you scale an "Internet megaservice", and how low the ratio of sysadmins to machines is. They have 10K machines and O(100) sysadmins. That's a ratio (you can do the math with me here!) of 100 machines per sysadmin. I claim that to be total crap. I can do 100 machines to sysadmin with almost no automation or fancy management anything. 50-100 is just what you can do with good server software and reasonable hardware and nothing particularly fancy. Hell, you can do 50 machines/sysadmin without even doing something like cfengine or puppet.

Not to start a religious crusade here, but one wonders if they are using a lot of Windows to do this and if that is why their sysadmin ratio is so low. Windows has been shown to be significantly more management-intense than Unix and Unix-derivatives (in part because of the lack of command line interface and the lack of a text-based exposure of the configuration of the device). This doesn't make Windows worse, necessarily, but it tends to make it less suitable to applications that require massive horizontal scaling. Google and Akamai don't use Windows and there are many good, non-price-related reasons for that. OK, so I lied. This does make Windows worse.

But the comments about scaling and the flexibility necessary to scale a computationally and storage dense service resonated for me. These are problems we struggle with at Renesys. Some of them have known, good solutions. Most of them have only trade-offs: faster, cheaper, easier to manage, but not all at once.

It's an interesting interview and worth reading, even though the content is a bit thin.

About the Renesys Blog

Our weblog is written by a variety of Renesys employees. They run the gamut from senior execs and engineers to sales guys. Anyone who has something to say that could be informative or of interest to our customers and visitors, says it here.

About this Entry

This page contains a single entry by Todd Underwood published on January 18, 2006 4:16 PM.

Sprint Outage — So What? was the previous entry in this blog.

Justice Raids Google is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Archives

Pages