Since the major outage to Panix and others caused this past weekend by Con Edison Communications, a number of people have been asking what was the root cause?. That is to say, what were the circumstances underlying Con Edison Communications's error in announcing the networks that they announced? In the intervening days, we have learned something about what happened, and there is room to reflect on what it all means for the future stability of the Internet.
Several people, including current and former Panix admins have indicated that Panix was a former customer of Con Edison, and that this might have explained why Con Ed had Panix routes in their RADB as-27506-transit object. Checking Renesys's records of routing data going back to Jan 1, 2002, I see no evidence of Con Edison Communications (AS27506) and Panix (AS2033) being adjacent to each other in any announcement from any of our peers at any time since then. So I can't really verify that Panix was ever a Con Ed Comm customer. Can anyone clear this up? So far, it's not making sense, or at least it's not adding up to a full picture.
The supposition was that all of the other affected ASes that are not currently customers of Con Ed Comm are former customers. Some appear to be (Walrus Internet (AS7169), Advanced Digital Internet (AS23011), and NYFIX (AS20282) for sure were customers of Con Edison in the recent past) but others don't. So this theory of former customer status doesn't appear to be the full explanation.
But this isn't really the "root cause" that Steven Bellovin was asking for above on the NANOG mailing list. This is really a proximate cause. The root cause or "ultimate cause" is that filtering is imperfect and out of date frequently. This case is particularly interesting and painful because Verio (NTT America, AS2914) are known for building good filters automatically. In fact, they are somewhat notorious for being rigid (and effective) in their automated filtering. So it is particularly distressing that they were implicated in this event in any way at all. In this case, Verio built their filters out of dated, incorrect information. Con Edison added the Panix routes to the RADB route registry as routes that they were allowed to announce. Verio believed it and built their filters accordingly.
Normally in cases of leaks like this, the propagation is via some provider or peer who doesn't filter at all. In this case, one of the vectors was one of the most responsible filterers on the net. sigh.
So in terms of engineering good solutions, the space is pretty crowded. One camp is of the "total solution" variety that involves new hardware (probably), new protocols (definitely), and a Public Key approach where either originations in the case of soBGP (or any announcements in the case of sBGP) are signed and verified. This is obviously a very good and fairly complete approach to the problem but it's also obviously seeing precious little adoption. The soBGP and sBGP IETF drafts all appear to have expired, which is disconcerting. Both have been around for several years and neither can point to a single large-scale adoption. And in the mean time we have nothing.
Another set of approaches has been to look at alternate methods of building filters, taking into account more information about history of routing announcements and dampening or refusing to accept novel, questionable announcements for some fixed, short amount of time. There's interesting work from Josh Karlin (along with Stephanie Forrest (who taught me an intro programming class back at UNM) and Jennifer Rexford) that suggests a way to build filters that penalize novel, suspicious routing announcements for some period of time while not impeding good, normal routes. This was also part of the work that Tom Scholl, Jim Deleskie and I presented at the last nanog. All of these strategies have the disadvantage of being partial solutions, the advantage of being implementable easily and in stages without a network forklift or a protocol upgrade, but the further disadvantage of being nowhere near fully baked. It's unclear what it would take to finish the cooking process, but I'm excited to see what arises.
Clearly more people need to keep searching for good solutions to this set of problems. Extra credit for solutions that can be implemented by individual autonomous systems without hardware upgrades or major protocol changes, but that may not be possible.
And in the mean time, routing on the Internet is more or less wide open and minor-scale disasters happen on a regular basis. There's a talk at NANOG 36 showing that route hijacking is dramatically more common than most people think. Good luck out there.
Update 2006-01-26 20:13 EST: Updated link to Josh Karlin's paper so that it works now. Also related discussion rages on the NANOG mailing list so I strongly recommend bopping over there for those who are interested.



Comments
I'd say verio is known to build filters, not that they are known to build 'good filters'. Their filters are wholey dependent upon their customers putting in truthful, complete and accurate information in their IRR. If a customer decided to:
1) not clean up
2) not be truthful
3) not be complete
verio would still build their policy automatically and without any sort of human/quality check...
So, just a nit really :) Thanks.
Posted by: Chris | January 30, 2006 11:41 PM
It's a nit, for sure, chris, but it's an important one, and an issue I probably glossed over. The hardest part about building good filters is having good data. In the past, the belief was that there would be a 'well-run IRR' that would somehow magically contain this data. This appears to just not be in the cards.
Lots of people have been spending lots of time to figure out how to do filtering in the absence of such data. Clearly, there's a lot more work left to be done.
Posted by: Todd Underwood | February 2, 2006 07:52 AM