Since the major outage to Panix and others caused this past weekend by Con Edison Communications, a number of people have been asking what was the root cause?. That is to say, what were the circumstances underlying Con Edison Communications's error in announcing the networks that they announced? In the intervening days, we have learned something about what happened, and there is room to reflect on what it all means for the future stability of the Internet.
January 2006 Archives
Well, not the whole Internet, but Con Edison (AS27506) "stole" several important prefixes on the Internet earlier today, probably by mistake. Earlier this afternoon, I saw a message on the NANOG mailing list claiming that Con Ed was "stealing" routes to Panix, the venerable New York ISP, who had previously been hit with another outage beyond their control. Looking quickly into this with Renesys Routing Intelligence, it's far worse than that.
Con Edison apparently spent the better part of last night and today pretending to be a fair number of other people's networks ranging from Martha Stewart Living to NYFIX, from The New York Daily News to Walrus Internet. This is bad. While some of these networks were customers of Con Edison, many were not. Did anyone else notice or care that all of their traffic was being misrouted or is Panix the only one of these people who isn't asleep at the switch? Read on for significantly more detail about what we saw happen and who was affected.
The SJ Mercury News (and lots of other people) are reporting that the US Justice department is trying to get Google to disclose massive amounts of search index data. What's unique and troubling about this is that Justice aren't claiming that Google have done anything wrong or that the Google information directly relates to any crime: they just want use Google's index as a way to save themselves the hassle of indexing the web for themselves.
There's a good interview at ACM with Phil Smoot, an engineer on the Hotmail project and a product manager for MSN. The interview attempts to address issues of operations and systems scaling on an Internet-scale service and as such is interesting to me. It's also full of some silly platitudes: comparing hotmail to the Everest of "megaservices" even though it is several orders of magnitude smaller than some competing services and applications like Google Search or Yahoo! Search, for example.
Sprint (AS1239) had a pretty big outage yesterday. It took out voice and data services to a big chunk of the Southwest and California. The problem was that Sprint was doing maintenance on one part of their SONET ring and took a failure on the other part. That happens sometimes (hopefully not very often). So Sprint took some heat for it, and rightfully so.
This is a graph of the unreachable networks during the period. The sharp spike on the left is the Sprint event, from about 20:30 UTC (15:30 EST) to about 23:30 UTC (18:30 EST). From the scale, we can see that about 300 networks were affected. Sharp rise, sharp fall. Definitely a specific event that impacted the affected networks. But the event probably raises more questions than answers: Why are there even more outages later that night into the next day? Is an outage that affects 300 networks a big deal or a non-event?
A few days ago Om Malik suggested that consumers don't need any more bandwidth. He made a couple of interesting claims about the new round of speed upgrades being offered by networks in the US (and being mirrored by adsl2+ rollouts in Europe). Does anyone really need 6Mb/s (or 30 Mb/s) at their house, or is it just a big ruse by the communications companies to rip off your money?
In an interview back in November, SBC (now AT&T) CEO Edward Whitacre started a firestorm. He implied that since SBC owned the fast pipes into people's homes, they could control the Internet access that those people received. The networking public and media immediately began worrying about a "two-tier" Internet or "partial" Internet access where SBC customers could not access some content (or not access them quickly or effectively) unless that content provider paid SBC.
The Internet has succeeded largely because of its end-to-end nature (making it possible to deploy new, interesting applications without reconfiguring the network) and because of it's universality. Whitacre's comments seem to affect both principles, although it's tough to say. Since the firestorm about this is still brewing, it seems worth walking through the arguments more carefully to figure out: is Whitacre proposing a two-tier Internet, and is that a problem?
