As promised last week here's part two of the story about a rough week at Cogent last week. When last we left our intrepid, optical network, it was depeering wee little British autonomous systems in an effort to gussy itself up for future suitors (we guessed; although there were several other interesting guesses as well. More on that shortly). Well, things went downhill from there.
On Wednesday, April 25, at about 19:25 UTC (15:25 EDT / 12:25 PDT), Cogent had a fairly serious backbone issue. It was reported on NANOG. It was a moderately large event at the time, with a total impact on most of Cogent's network for about 45 minutes, and at least some part of the network affected for almost three hours. The problem was attributed to a router software bug. Cogent had another problem later in the week, on Friday, that appears to only have impacted customers in Boston.Part of my interest in these events is personal: Renesys (AS34135) is single homed to Cogent at a development site in Boston. These two outages happened to both hit during the middle of user testing for a new application we're working on (more on that in the coming weeks). So that was pretty embarassing and frustrating. We're shopping around for other providers at 1 Summer now, but (as usual) providers are unclear on whether they can offer service in the building and what they might charge to do so. So we're waiting. Additionally, two of Renesys's three other service providers in New Hampshire, Worldpath (AS3770) and SEGNet (AS11524) both use Cogent as one of their upstreams as well. So we were impacted by the problems. But being a customer of, or a provider to someone who has a network problem isn't enough to raise my interest (we have a lot of customers who run networks, strangely enough).
My main interest in the Cogent outage is that it was large enough to be felt across the Internet and gives me an opportunity to look at some of the ways to understand and analyze such events after the fact. So let's take a look at what happened, not just from the RFO (Reason For Outage) issued by Cogent, but rather what the whole Internet thought of the event.
