Sprint Outage — So What?

Sprint (AS1239) had a pretty big outage yesterday. It took out voice and data services to a big chunk of the Southwest and California. The problem was that Sprint was doing maintenance on one part of their SONET ring and took a failure on the other part. That happens sometimes (hopefully not very often). So Sprint took some heat for it, and rightfully so.

sprintoutages.png

This is a graph of the unreachable networks during the period. The sharp spike on the left is the Sprint event, from about 20:30 UTC (15:30 EST) to about 23:30 UTC (18:30 EST). From the scale, we can see that about 300 networks were affected. Sharp rise, sharp fall. Definitely a specific event that impacted the affected networks. But the event probably raises more questions than answers: Why are there even more outages later that night into the next day? Is an outage that affects 300 networks a big deal or a non-event?

When I write "300 networks were outaged" the term "network" means "slot in the global routing table." That's not very specific. It could refer to just a couple of computers in a /24 (up to 254 computers). Or it could refer to millions of computers in a /8 (up to 16 million computers). On average, it's really somewhere in the middle (up to a few thousand computers). Still, 300 of those dropping off the Internet all at once sounds like quite a lot. And it probably is.

But look at the sharp (higher) peak at 08:00 UTC (03:00 EST). It goes above 1100 networks outaged, and the net increase over a short period of time (less than an hour) is about 300 networks (although nowhere near as sharp an increase as the Sprint outage). The outages during that period appear to be caused by events on AS702 (UUNet Europe/Middle East) and AS9808 (Guangdong Mobile Communication Co.Ltd. in China). So a large and a medium-sized carrier with some collection of network events affecting a decent-sized number of people. Not surprising at all.

Even in the context of the one-day graph, it seems obvious that an outage of 300 networks isn't that serious. What about in the context of the week graph:

outagesweek.png

Yep there you go. You can see the little narrow Sprint event about two-thirds of the way across. But it's nothing compared to the two big events that took place over the weekend. Those lasted almost 24 hours each and affected closer to 800 networks. Clearly events that take out a few hundred networks are not uncommon at all. The network giveth. The network taketh away.

And obviously, the density of meaningfulness on the Internet varies widely. Sprint loses 300 networks and most of the USA feels the tremors. 800 or 1000 networks flicker in and out of existence somewhere else, and even most network operators sleep blissfully through it.

There's something else to be said about this outage affecting land-line voice, mobile voice, data circuits, and Internet services from Sprint. Clearly these services are converged onto the same physical network. But this is not the much ballyhooed convergence that analysts keep promising. Or if it is, it turns out that SONET (not IP) is the actual convergence layer. That's funny.

Comments

Thanks for the informative perspective, Todd.

This article prompts me to (re)consider things like GMPLS and other 'layer-1 reconfiguration' systems. Even though GMPLS is little more than speculative vendor interest coupled with a handful of desperate providers at the present time, I think things are far enough along to see that we could arrive in an even worse situation if we give the ability to dynamically create new physical paths out of a shared resource. Specifically when the company that owns this shared resource has no motivation not to significantly oversell their actual capacity, or disclose specific, actual capacity.

In my experience, no DR company ever expects a significant majority of their clients to all show on the doorstep of the DR site at one time. However, the Chicago Tunnel flood incident is here to remind us that it's not a good idea to oversell emergency services (despite sales peoples interests) in an especially profound way.

I predict we'll see the same 'learning' occur within GMPLS providers when the time comes.