House of Cards

Time flies. Although it was over 18 months ago, it seems just like yesterday that a small Czech provider, SuproNet, caused global Internet mayhem by making a perfectly valid (but extremely long) routing announcement. Since Internet routing is trust-based, within seconds every router in the world saw this announcement and tried to pass it on. Unfortunately, due to the size of this single message, quite a few routers choked – resulting in widespread Internet instability. Today, over a year later, we were treated to a somewhat different version of the exact same story.

First, let’s review the Czech incident from February 2009. There were many positives to take away.

  • It was precipitated by an honest mistake.
  • It was an extremely unlikely event, as many stars had to be in exact alignment.
  • Most of the Internet’s core survived.
  • The response from operators was fast and efficient, with the damage largely contained within an hour.

The complete technical details can be found here.

Deja vu all over again

Fast forward to today: Friday, 27 August 2010. What do you think would happen if another large and unusual routing announcement was made on the Internet? Do you think all the router vendors have perfected their code in the past 18 months? Do you think the entire planet has upgraded to this new, improved and perfect code base? Do you think it makes sense to use the Internet as your testbed? I doubt you answered “yes” to any of these questions.

We’ll begin to describe what happened today with a snippet from a private mailing list. We’ll purposely leave out the technical details so that we don’t inadvertently contribute to the building of a Cybernuke.

On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing Information Service (RIS) announced a route with an experimental BGP attribute. During this announcement, some Internet Service Providers reported problems with their networking infrastructure.

Immediately after discovering this, we stopped the announcement and started investigating the problem. Our investigation has shown that the problem was likely to have been caused by certain router types incorrectly modifying the experimental attribute and then further announcing the malformed route to their peers. The announcements sent out by the RIS were correct and complied to all standards.

While standards compliance is nice, it is foolhardy to assume that all BGP implementations are perfectly compliant, especially given recent history. Over 3,500 prefixes (announced blocks of IP addresses) became unstable at the exact moment this “experiment” started. Not surprisingly, they were located all over the world: 832 in the US, 336 in Russia, 277 in Argentina, 256 in Romania and so forth. We saw over 60 countries impacted by a “correct” announcement that “complied with all standards”. The following graph shows the timeline of the event, followed by a map of the impacted countries by prefix count. Notice that it takes a bit for the Internet to stabilize after RIPE claims to have withdrawn the announcement at 09:08 UTC.

timeline.png

Impacted Countries by Prefix Count

RIPE-country-dist.png

Conclusions

On the positive side, the incident was very brief, the damage was limited to under 2% of the Internet and the responsible parties quickly fessed up, aborting their “experiment”. On the negative side, the Internet remains a very fragile place, even if that fragility is highly localized and different in different places. Standards aren’t followed, code isn’t tested and people make mistakes. That’s life with any complex system and, while we can certainly do a better job, we will continue to see these types of events no matter what safeguards we might take. What puzzles me is how anyone thought it might be a good idea to test fate in this way. The end result was completely predictable.

7 comments
Martijn Bakker
Martijn Bakker

Let me see if I get that note: Big Pharma = $vendor "you" = RIPE NCC's RIS So what you are claiming is that RIS purposefully and wellwillingly poisoned the internet to prove a point? I think I see where your thinking went stray there. We already figured these kind of experiments shouldn't take place on the big bad (and apparently extremely fragile) internets, and I agreed (see 1st comment, 1st sentence). And I think RIPE RIS and Duke will wholeheartedly agree with you on that one too after this incident. It was surely an eye-opener. More lab-testing should be done (if resources permit) before stuff like this is being attempted publicly. But, as I said before and again, the experiment they conducted was not the main cause. It just acted as a trigger for the main cause, in this case the malformation of a well-formatted attribute by a router produced by BigVendor and running software produced by BigVendor. This, in turn, led to disconnection of BGP sessions (as a safetymeasure) from routers where this malformed attribute was being forwarded to. It could have been anyone, and the same routers would have made the same error. The vendor is the only consistent element in this story. Why all this hate towards RIS? I have a hard time believing Renesys is, in some way (namely pushing the idea that RIS is purposefully harming the internets), trying to make a black sheep out of RIS. again, all comments are my own. Editor's note: "RIPE NCC is going to be stricter about the way it runs such experiments and will give Internet operators advance warning in the future." That's a win for everyone.

Martijn Bakker
Martijn Bakker

I agree where you say that the Internet is not a playground -- as I wrote in my previous comment. Again, this is not my primary worry, and the blogpost is covering that topic well enough. Saying the internet is "too fragile" for these "games" though, is, in my eyes, ridiculous. First off, I don't think anyone at Renesys would like it if we compared the sincere and honest work you do with "games". The same goes for what RIPE NCC's RIS does. These people are no less than anyone, and their work isn't either. Working for one and having worked for multiple other RIPE members, I can say their work, as far as it touched me, has never been less than professional. Saying the internet is "too fragile" is exactly what my point is about. If this is true, this must change. If we look at the direct consequences, we only see 1 major vendor (who I shall not name) who's BGP implementation is causing all the damage, as a result of a _completely valid_ BGP package. So let's start there. I bet that if you would be able to look at this vendor's BGP implementation, this isn't the most horrible bug in there. Blaming RIS or anyone other than this particular vendor for what happened is like blaming the person operating a light switch for causing an electricly induced fire because of faulty wiring. It was never the intention of RIS to even have a fire drill, let alone triggering a fire. Again, the attribute in question was completely valid. If you really believe that the small meltdown is solely the fault of RIS, maybe you should think about taking a holiday. The words in this and my previous comment are mine, and mine alone. Editor's note: If you suspected that the manufacturers of vaccines were exaggerating their marketing claims, and human society was "too fragile as a result," would you go around subway stations infecting people to prove your point? And then blame Big Pharma for the resulting chaos? There are ways to carry out these experiments without burning down the global village, and RIPE RIS has a long history as a sponsor of responsible Internet research. I'm still waiting for RIPE RIS and Duke to explain whether this experiment was in line with their institutional research policies.

Brighten Godfrey
Brighten Godfrey

Jim, your comment and the original post seem to assume that the RIPE/Duke research was conducted "in hopes of tickling an easily weaponizable exploit". Do you have any reason to believe that? It seems more likely that it was not the researchers' intent to trigger any bug. (The statement on RIPE's web site says "Before starting the experiment, the RIPE NCC conducted limited testing and did not encounter any issues.") Seems like the bigger problems lie in buggy software and protocol design.

AC
AC

If everyone treats internet routing as something fragile that should never be exposed to anything remotely risky, it is going to stay very fragile. And that would mean that someone nefarious will have a much easier time finding a 'Cybernuke'. Editor's note: And there is a significant difference between a well-planned "fire drill" and screaming "Fire!" in a crowded theater.

Jim Cowie
Jim Cowie

My worry is that incidents like this degrade trust relationships in ways that are going to reduce everyone's ability to do research and advance the state of the art in Internet measurement. There's a good reason why Renesys wrote its own BGP speaker, incapable by design of sending routes to our BGP peers. We don't send out ANY update traffic -- Ever. Not a single message. It's written into our agreement with peering partners. The Internet is too fragile for these kinds of games. If you want to play "let's see how many core routers we can crash," the place for that is in a lab, in partnership with the vendors. Sending crazy inputs to your peers over trusted BGP sessions at one of the best-connected places on the planet in hopes of tickling an easily weaponizable exploit should be a clear violation of institutional research policy. If it's not, change your policies. Whether you're an enterprise, a vendor, or a researcher, these kinds of irresponsible games should be clearly designated "out of bounds" on the public routing fabric without careful advance coordination and discussion. What did RIPE RIS gain by the element of surprise? Speaking for myself only.

Martijn Bakker
Martijn Bakker

Hi, While I agree on the point where people shouldn't use the internet as their testbed, I really think we should look more to the vendors. ISP's, hosting providers and even more and more companies outside the core internet business pay a leg and an arm for routing and switching iron and their support contracts. Albeit a bit naive train of thought, I admit, I believe we can expect a more proper implementation of the BGP protocol, especially because it's a large part of the business of the vendor(s) in question. As suggested on several message boards and mailing lists around the internet it's like the vendor(s) in question have never heard of fuzzing. This is especially true, as nowadays it's really simple for a possibly malicious group of people to launch your so-called cybernuke without being so open about it. Which brings me to my point: Instead of bluntly bashing your 'competitor' in networking tools and stats, I think we can all be thankful it was RIPE NCC's RIS. This time. Editor's Note: Interesting argument. Hope you get plenty of "Thank You" cards.

No route to ...
No route to ...

試験的BGP attributeで3500prefixが不安定に 2009年2月にチェコのプロバイダーで、BGPで極端に長いAS pathをannounceしたせいで、Ciscoはじめいくつかのルーター実装で次々と死亡していって、そのせいで全世界のインターネットが大混乱した事件がありましたが、それと似たような事件がまた発生。 今回はRIPE NCCが試験的BGP attributeを広告したことで、一部ルーターがそれを不正に変更してしまい、それが元でそのルーターのpeerにおかしなルートをannounceしてしまったらしい。...