Rob: "One problem is see is that there is simply too much Cisco in the world..." Partly true. But sometimes ubiquity can help- in this case, the prevalence of Cisco kit meant that there was an even chance that some Cisco code was involved- which turned out to be the case. In the unregulated world of the Internet, this meant that minds like Ivan P's and others could identify the cause (and solution) of an event quite quickly. gary
Longer is not always better
This post is a follow-up to our blog last week about a small Czech provider briefly causing global Internet mayhem via a single errant routing announcement. In this incident, SuproNet (AS 47868) announced its one prefix, 18.104.22.168/21, to its backup provider, Sloane Park Property Trust (AS 29113), with an extremely long AS path. We’ve gotten more feedback about this entry than any other in recent memory, so we thought we’d try to answer some of the questions that were posed both here and elsewhere, as well as provide some clarification about exactly what went on. The questions we try to address include:
- How could anyone be this dumb?
- Why did this cascade throughout the planet?
- Can you provide more details about the impact and its spread?
- How do we prevent this from happening again?
How could anyone be this dumb?
I’ll admit that this was my first thought. And since this incident interrupted my lunch, I was only too happy to join the mob. In hindsight, my reaction was due to the fact that my router experience is largely limited to Cisco gear and their router software, known as IOS. For example, suppose SuproNet was using a Cisco and wanted to prepend their ASN (47868) an additional four times to announcements to a particular provider. They could use something like the following, where the string of x’s refers to the IP address of the provider’s router. Notice that I had to explicitly list 47868 four times.
neighbor xx.xx.xx.xx route-map longerisbetter out route-map longerisbetter permit 10 set as-path prepend 47868 47868 47868 47868
The is a common way of prepending in Cisco IOS, so I naturally thought, who would be so dumb as to type (or cut-and-paste or whatever) their own ASN hundreds of times into a configuration for a router? Who? The only problem with this line of reasoning is that SuproNet wasn’t using a Cisco. They were apparently using a router from MikroTik, a router vendor from Latvia, as first reported in this Czech blog. MikroTik obviously targets the Czech market since they have a local language web page and domain.
So how do you prepend on a MikroTik? According to their on-line manual, you set the following variable in an appropriate configuration mode.
bgp-prepend (integer: 0..16) - number which indicates how many times to prepend AS_NAME to AS_PATH
So if SuproNet was thinking “Cisco IOS”, they might have typed “bgp-prepend 47868″ to prepend 47868 once. However, this would be a mistake as this router is expecting a count, not an ASN. So at this point, it would be reasonable to expect the MikroTik to report something like “value out of range”. Let’s assume they didn’t do any range checking on the input value and let’s assume they devoted one byte (8 bits) to store this value. One byte can represent all integers from 0 to 255. So what happens when you try to stuff something larger, like 47868, into one byte? You get 47868 modulo 256 (i.e., the remainder after dividing of 47868 by 256), which equals 252. As Mikael Abrahamsson first noticed, this was the exact number of prepends of 47868 he was seeing. So I went back and looked at the copious number of announcements we saw of SuproNet’s prefix and guess what? Every single one had 252 prepends of 47868, leading me to conclude that this was the exact number sent out by SuproNet. Originally I was thinking the number of prepends probably varied based how these long paths were being truncated and that it was this random truncation that was causing part of the problem.
And using this clue, Ivan Pepelnjak was able to spell out exactly what happened in his blog. As it turns out, the reason for all those routing resets and general instability was due to a previously unknown Cisco bug involving AS paths close to 255 in length. If you try to prepend to a long path that you receive and by doing so, create a path longer than 255, you are toast. So the maps we gave in our our last blog were more of an indication of Cisco market share (at least among prependers), rather than the propensity of outdated routers. Kudos to Ivan for figuring this out.
In summary, we have a situation where a single careless operator in the Czech Republic tickled one bug (i.e., lack of bounds checking in the MikroTik router) that in turn tickled another bug (i.e., a problem with long AS paths on a Cisco). And the result was global Internet instability due to prevalence of Cisco gear in the market. But in fairness to MikroTik we note that Mathias Sundman observes that bounds checking does now exist in version 3.20 of their router software.
Why did this cascade throughout the planet?
Short answer: There is a bug in Cisco IOS with regard to long AS paths and lots of folks use Cisco gear. Longer answer: Most ISPs apparently do not filter out announcements with long AS paths. As we noted in our previous blog on this topic, we are all fairly close to one another on the Internet and there is really no reason to be seeing excessively long AS paths. Such paths only indicate a problem or a clueless operator or both, and can be safely discarded. The fact that they were not dropped allowed them to tickle this bug on many Cisco boxes along the way.
Since we are all just a few AS hops away from each other, the problem only occurred because the paths originated from SuproNet were so close to 255. This allowed them to reach the core of the Internet and continue onto other edge networks before exceeding the 255 path length boundary. It was only when they did that all hell broke loose, far away from the original source of the problem. As Andree Toonk provides on this page, there are apparently others who have made the same mistake on MikroTik routers. AS 20912, Panservice of Italy, is doing it as of this writing, but 20912 modulo 256 is only 176 and these announcements are apparently not causing a problem.
Can you provide more details about the impact and its spread?
This was an easy one, as Renesys monitors every prefix (network) seen on the Internet and computes their stability over time. We also geo-locate them as accurately as possible. Thus we can see events like this propagate through the planet and Google Earth provides an excellent way of performing the visualization. We used it to show every newly unstable prefix in during the hour before and the hour of the incident. Here are a few composite images taken from Google Earth of a few regions and an indication of all the unstable prefixes seen during the 2 hour period. We start with the US where the impact was the greatest.
Next up is the heart of South America, where Cisco obviously needs to send some sales folks. (Before someone points out the population density of South America relative to the US, we noted in our last blog that South America was the least impacted continent on a percentage basis.)
Finally, we take a look at Europe, where all the trouble started.
How do we prevent this from happening again?
This one is really about assigning blame and there is plenty to go around. But before we get too caught up with that, keep in mind that this was really the perfect storm. As of today, Renesys has observed 31,188 unique non-private ASNs on the Internet over the last few weeks. If you compute modulo 256 of each of them, you get 731 with associated values ≥ 250 or 2.3% of the total. There is nothing special about 250. However, the likelihood of a problem decreases significantly as the values get lower, and 250 seems like a reasonable cutoff, given typical path lengths in the Internet. And there are still only 1,919 ASNs whose modulo 256 value is ≥ 240, or 6.2% of the total. Thus for this event to have occurred at all, besides the bugs in the router software of two vendors, only a few percent of the ASes on the Internet could have possibly initiated the meltdown, but only if they had a careless operator and an obscure Latvian router with outdated software. How likely was that?
As for the blame, network operators (SuproNet) should obviously read their router documentation and test any proposed changes in a lab environment to see if they get the results they expect. Router vendors should check bounds on input parameters (MikroTik) and on boundary conditions (Cisco). ISPs should filter out obvious useless garbage, like ridiculously long AS paths and unrouteable (private) IP addresses. They obviously don’t, given the scope of the event. And who designed this BGP routing protocol anyway? What were they thinking?
Seriously, the reason for the success of the Internet is because it is not under the control of any one government or company. Because of this fact, it is both cheap and ubiquitous. But because there is no centralized control or authority, we are largely at the mercy of the weakest link. Sure there is plenty we can do to prevent things like this from happening again, but there will always be the next perfect storm. Who could have guessed something like this could have happened? You won’t be able to guess the next one either. The happy ending to this story is that the community quickly rallied and worked together to both identify and mitigate the problem. No meetings were held, no bailouts were requested and not a single lawyer was needed to draft an agreement. The Internet was back to normal in short order.
A bit of information more: the "unknown Cisco bug" (as-paths longer than 512 bytes) was known at least from 2005. See page 14 of http://web.dia.uniroma3.it/ricerca/rapporti/schedaRapporto.php?id=102
"If your child leaves a toy on the stairs and you walk by it 100 times without picking it up, will you blame the child when you trip over it on the 101th time? You would be tempted to do so from the hospital bed, but you should not." what kind of crack smoking analogy is that? you are somehow drawing the comparison that a parent picking up a toy that is left in the wrong place, is somehow the same as the same as a network engineer not applying security patches when required? furthermore "That's like saying that MS is responsible for the code that a hacker rights that gets executed against a server that a end user has on the Internet with no patches even though a MS PSIRT was released informing everyone in the world about it. That's not how normal life works, man." correct me if im wrong, but bugid CSCsx73770 was only just created, i am running latest code trains (eg 12.2(33)SRC3) and i have affected code. to allude that this is our fault because we have not upgraded our code trains, is utter rubbish. i am trying to think of the last time i did NOT have bugs open with cisco that i have found. whether it is true or not, it seems the quality of the code coming out of cisco has gone down the shitter. i guess thats what happens when you shut down the service provider teams and cater to the consumer market.
Well written and prescient article. Nice investigation and as Randy says "Good lookin out, dog." One thing, though - you should have gotten some bailout money. Everyone's doing it. While the blame can be directed at a few, it is really shared by many. Operators should keep up with bug/defect notifications for the equipment they operate. One of the peanut gallery comments says that the FAULT IS THEIRS! That's like saying that MS is responsible for the code that a hacker rights that gets executed against a server that a end user has on the Internet with no patches even though a MS PSIRT was released informing everyone in the world about it. That's not how normal life works, man. If your child leaves a toy on the stairs and you walk by it 100 times without picking it up, will you blame the child when you trip over it on the 101th time? You would be tempted to do so from the hospital bed, but you should not.
Considering how much the internet depends on Cisco, they do a pretty bang up job. Even the space shuttle's software team sees 5 defects per 10k lines of code..
For some reason you seem overprotective of Cisco. The fault IS THEIRS! And they should be crammed down for this. No matter how incompetent the other parties were. Editor's note: We have no business relationship with Cisco or any of the other parties identified here.
Could this problem still be affecting us in the Southern hemisphere today [23rd Feb]? Editor's Note: As of this writing, the world is amazingly stable from a BGP perspective.
You write: "And who designed this BGP routing protocol anyway? What were they thinking?" BGP was designed by Yakov Rekhter and Kirk Lougheed, with input from a cast of hundreds (the IETF's IDR WG). They were thinking that it would be good if packets could get from point A to point B in a reasonably sensible way. That pretty much necessitates a routing protocol that propagates information from one side of the Internet all the way to the other. When there are flaws in the implementation of that protocol (as opposed to the design) and insufficient genetic diversity in the implementations in the field, this is the result. Tony Li IOS BGP developer 1991-1993 p.s. Oops. Sorry. ;-)
How a Router’s Missed Range Check Nearly Crashed the Internet “A bug by router vendor A (omitting a range check from a critical field in the configuration interface) tickled a bug from router vendor B (dropping BGP sessions when processing some ASPATH attributes with length very close to 256), causing a rip...
Seems to me that there is too much trust, or perhaps worse, "blind faith" in neighbour-to-neighbour BGP advertisments and updates. I'm not sure exactly what the diameter of the internet is in network-border-network-border terms but I'm guessing that as most places can be reached in under 30 IP hops and there are two or more hops per ISP that the number of AS's traversed is less than half that, say 10-15 at tops...? So, shouldn't we filter incoming BGP updates with large as-paths and drop them on the floor? i.e. we should have some form of access control list-like rules on BGP updates... and logging (syslog() or SNMP traps?) to catch this sort of problem - then all that would have happened would have been a call from the neighbour ISP's network operations centre to say "hey guys we're getting duf BGP updates from you and are dropping them" rather than the internet catching a cold? Sounds like routers need to gain some BGP filtering commands and ISPs need a code of practice to me! Mike
In about 1994, I remember playing around with some code called "crash.c", which took a list of syscalls and executed them with random garbage in the arguments line. It was generally considered really cool that this could be run on the linux-du-jour for days at a time without causing anything except core dumps, but that it would cause both windows (3.11, as you ask) and NT to die horribly within seconds. Such is life. It would be interesting to get a similar style BGP crap generator and hook it into a bgp cloud, to see which vendors die and how they die.
網路是很脆弱滴 Internet 的興起靠的是大家都能亂搞，但是最終把 Internet 搞爛也是因為大家都在亂搞。 今天看到一則有趣的新聞，講的是發生在幾天前的一次網路大爆炸事件；故事很長，有興趣的人請直接參...
Thank you for the analysis of the details of the incident. One problem is see is that there is simply too much Cisco in the world... As we have seen on the client side, a computing monoculture is a dangerous thing. We have many large botnets that could do considerable harm to various portions of the Internet. The is usually due to the dominance of Windows on the desktop or the prevalence of the BSD networking stack used by most operating systems. I see the same happening with Cisco equipment. They make phenomenal gear, but the prevalence of this equipment led to a magnification of this "perfect storm." I have to commend the network engineers who responded to this situation for addressing before it became even worse.
And that last bit is the entire point of the internet. For all its crufty and often stupid quirks (127/8 just for localhost? I could put localhost, no-address-assigned, link-local, all-local-hosts, and a host of private ranges in a single /8, thank you; directed broadcast that turned out to be a bad idea but still eats two addresses out of every subnet; I'm sure you can come up with more) it does work pretty amazingly well. Now see various governments try and do a landgrab in various ways. Logging, tapping, retention, grabbing the dnssec root keys, and so on and so forth. If only we could afford to care for the technical side only.
Brilliant article! I have no idea what Renesys is, but I'm going to have a poke around. Dug the writing, thanks guy(s)!