Internet-Wide Catastrophe—Last Year
One year ago today TTNet in Turkey (AS9121) pretended to be the entire Internet. And unfortunately for the rest of the Internet, many large network providers believed them (or at least believed them in part). As far as anyone knows, it was a mistake, not a malicious act. But the consequences were far from benign: for several hours a large number of Internet users were unable to reach a large number of Internet sites. Twelve months later we can take a look at what happened, and whether we’ve learned much in the intervening time.
Early Christmas Eve morning 2004, TTNet (AS9121) started announcing what appeared to be a full table (well over 100,000 entries) of Internet routes to all of their transit providers. I was on call that Christmas (as I am this Christmas; I’m sensing a bad pattern here). So around 4:30 in the morning US Eastern Standard Time, I started getting paged.
Renesys collects routing information from over one hundred peering sessions all over the planet. Two things happened to that infrastructure that caused me to get paged. The first is that we started taking a sharply higher rate of updates (routing information) from our peers. This is usually an indication of something seriously wrong with some part of the Internet. It puts lots of stress on Renesys’s data collection, processing and serving infrastructure when this kind of large-scale instability happens. The second reason I got paged is that parts of our distributed monitoring infrastructure were suddenly unable to get to other parts of our monitoring infrastructure.
We wrote a report for nanog earlier this year describing this event in some detail, so I won’t go into the gory details here. But the basics, for those of you who didn’t hear about it, are pretty simple: TT Net pretended to have the best path to everything on the Internet. Telecom Italia Seabone (AS6762), unfortunately, believed that most of those paths were the best paths and suddenly shifted all of their traffic from where it had previously been going (Amazon, Microsoft, Yahoo, CNN, BBC, etc.) to TT Net. Most other large networks believed the routes to a lesser degree as well.
So for a large number of Internet users, some chunks of the Internet were unreachable for at least a few hours on the morning of December 24 last year. Among those places without complete reachability was the cable modem service at my partner’s family’s house where we were staying that night. This made it considerably more difficult for me to troubleshoot and understand the problem last year. Virtually everything on the Internet was unreachable for someone: banks, governments, ecommerce sites, businesses, universities–no one escaped the damage.
Those of you following along at home might wonder: How can this happen? How can someone, though malice or error, pretend to be the whole Internet? This is pretty much how Internet routing works: networks announce routes and their peers either accept them and propagate them or do not. The ugly, not-so-secret secret of large Internet networks is that virtually all of them blindly accept everything they hear from other large Internet networks. In other words, they transitively trust all of their peers to each correctly watch over their own customers. Most of them do most of the time, but mistakes are common and costly. If you think that sounds like a bad system, you’d be right: it is.
More or less constantly, network providers are chiding each other for routes propagated in error. Back in September, Telefonica (AS12956) got in trouble with many providers for accepting a network 126.96.36.199/8, from their customer (AS26210, AES Comunicaciones from Bolivia). This network is normally advertised and owned by AT&T (AS7018). But Telefonica is not the only network to have committed such a grievous error. In fact, Telefonica isn’t even the only network to have erroneously pretended to route traffic to this network–188.8.131.52/8–in the month of September! Ncore, AS12676, also claimed to have that same network. Bad news. But obviously a common occurrence.
So you may also wonder: has this been fixed yet? The unfortunate answer is: not even a little bit. People who run large networks are, on the whole, still incredibly resistant to filtering the routes that they accept from other large networks. Jim Deleskie of Teleglobe, Tom Scholl of SBC (AT&T now) and I did a presentation about some of the reasons for this. Suffice it to say, there are many, valid reasons for large networks to be in the woeful state of disrepair that they are right now. That doesn’t make it any less woeful, though.
The Internet works. But those who work close to the middle of it may marvel on an ongoing basis that it works at all, much less as well as it does. In this way, the Internet models much of the rest of industrial society: it teeters as close as it can to the precipice, veering away from collapse only when it truly needs to, and only when enough of us look over the edge and decide we don’t really want to fall. Here’s to another year of not quite falling.