ACTION PLAN TO AVOID FURTHER OUTAGES Paris, Octber 18 th, 2013, These past few mnths, France-IX has undergne a series f utages which highlighted the limits France-IX has recently undergne a series f utages which highlighted the weaknesses f and weaknesses f the current netwrk design. the current netwrk design. France-IX wants t ensure its members that the whle team, with the help f experts frm the France-IX wants t ensure its members that any necessary actin is being taken t internet cmmunity, is putting int actin all the necessary steps t avid ging thrugh new avid ging thrugh such incidents again. issues. The present actins plan aims t give details abut the events, the technical reasn behind The present actin plan aims t give details abut why technically we had the utages and t draw them and the list f actins that will be made as sn as pssible t g back t having a the list f actins that will be made as sn as pssible t cnslidate the newrk and its availability. resilient and available infrastructure. 1/ THE CURRENT DESIGN FranceIX relies n a hetergeneus infrastructure including: MPLS in the cre sites (Telehuse-2 and Interxin-5) and in Interxin-2 and SFR Netcenter (Marseille) sites. Spanning-tree Ethernet n all the edge sites ((Interxin-1, TelecityGrup Curbevie, TelecityGrup Cndrcet and Iliad-Vitry). Our infrastructure backbne is based n MPLS lgical static links (LSP relying n RSVP) which ensures the best traffic and capacity engineering n the platfrm. We are using Brcade equipment MLX 8 /16 /16e with the cde versin 5.2h because this versin is cnsidered stable. The ther edge sites are pure Layer 2 and hst ther equipment. We have ld Frce10 E1200 n Iliad-Vitry, Interxin-1 and TelecityGrup Curbevie sites. The cde FTOS-ED- 6.5.4.3 is run n this equipment. This versin is the latest stable cde versin available fr this hardware.
In TelecityGrup Cndrcet, a Frce10 S4810 is installed with the cde versin FTOS-SE- 8.3.12.1. The edge Layer 2 sites are cnnected directly t the VPLS fabric thrugh 2 separate paths t ensure redundancy. FranceIX s current netwrk tplgy 2/ ENCOUNTERED ISSUES Over the past few mnths we have been facing sme big utages n the platfrm and we identified several recurring issues. Prt Flapping: At several ccasins, we had sme prts flapping between the MPLS ruters, a phenmenn that can happen in any live netwrk. On FranceIX s infrastructure, a defective link shuld nt be a prblem since the netwrk has been verprvisined. Our MPLS links (LSPs) are cnfigured with Fast rerute and BFD. The internally used prtcl (IGP) Isis is als running with BFD. These practices were recmmended by the vendr in rder t re-rute the traffic effectively, in the event f a link utage.
With the utbreak f flaps, it became bvius that when we were lsing a link in a LAG, we were facing instability, specifically in the VPLS cre. After discussin with several experts, it was recmmended t us and therefre we decided t remve BFD n MPLS and als n IS- IS. Since we made this cnfiguratin change, we had multiple flaps (due t link failures frm ur supplier, t prt relcatins) but these had n impact n the verall platfrm, which represents a standard situatin. Spanning-tree lps: Members nticed multiple unusual strm bradcasts n their prts facing FranceIX infrastructure. On the layer 2 infrastructure, we dispse f 2 types f equipment: Frce 10 E1200 (Etherscale) Frce 10 S4810 The Frce10 E1200 are ld equipment n which we can't activate sme Layer 2 prtectin such as strm-cntrl limitatin. The nly available ACL prtectin cnsists in filtering macaddresses. These past few weeks, we wrked n enabling mac-address limitatin all ver the infrastructure: nly ne explicit mac-address per member is nw authrized and cnfigured. On the Frce10 S4810, we can d bth and we d bth mac-address filtering and bradcaststrm limitatin (up t 1% f the prt). We als fixed the cnfiguratin n sme prts where spanning-tree was still activated. We dn't have any prts left with spanning-tree enabled n them. Hwever, sme members made a lp n their prts facing the exchange. On Brcade equipment, these lps had n effect n the rest f the infrastructure. While n Frce 10 E1200 equipment, we were nt able t cntain these lps t the prts f the cncerned members. These latter lps caused a lt f instability n the whle platfrm fr 2 reasns: We were nt able t autmatically shut dwn the cncerned prts As ur VPLS cre des nt run spanning-tree, the cre was unable t detect the lp emerging frm edge sites as an abnrmal traffic. Here we are facing a design incmpatibility between Layer 2 and VPLS. The redundancy we wanted t implement n the edge sites with layer 2 was in the end causing mre truble than slving.
Prt-Flapping AND spanning-tree Lps (last incident n Octber 16 th ) The rt cause f the last incident was a cmb f prt-flapping with a spanning-tree lp. We had a flapping card in ur chassis in Interxin-2. This card was part f a 6*10G lag between Interxin-2 and Interxin-5. We discvered and understd a bug n the Brcade sftware. If yu lse the first link f a lag (it's specific t Brcade) all the lag ges dwn. With Brcade, the relad f a card is really fast, but this was causing mre pain as the VPLS tplgy was als flapping. As we dn't run spanning-tree in the cre, an infinite lp was created between the layer-2 equipment and the fabric (because they have 2 links t 2 different VPLS ndes). We have mved Layer 2 equipment t a single attachment t the VPLS ndes. The incident was amplified by the physical lp we had between 2 prts n the VPLS fabric fr ur mnitring system (sflw exprt). That was the reasn why we disabled the statistics fr a day while changing the stats tplgy n that same day. 3/ ACTION PLAN The netwrk is nw stable thanks t several fixes we already applied n the infrastructure and we encurage the members wh deactivated their sessins t put them up again. In additin t that, we nw plan a list f actins that will be made t: t ensure stability and cntinuity in the services delivered t the members, t have a hmgeneus netwrk and t be able t include high-speed cnnectin (100G) All required actins will be taken t reach these gals. Fr the time being, we will disable the redundant link between the Layer 2 infrastructure and the VPLS fabric until we replace them. In the meantime, the redundant links will be activated specifically if an issue ccurs n the primary link. This pint des nt cncern the sites intercnnected in MPLS. In additin t that, ver the next weeks, we will : Upgrade the cde f ur Brcade MLX t 5.4.ca, Upgrade the Frce10 S4810 t FTOS-SE-9.2.1.0, Remve the Frce10 E1200 and replace them by ther equipment Migrate ld cnnectins f members cnnected t PaNAP switches int FranceIX wn equipment (nly TelecityGrup Curbevie site is cncerned) The cnfiguratin f the infrastructure will als be imprved and a new netwrk tplgy, which is currently being studied, will be annunced t the members and rlled ut in due time.
What we learned : Dn't nly trust yur equipment supplier n the cnfiguratin they advise yu Trust the cmmunity and get helps frm them, Industrialize the prcesses, Apply drastic filtering rules t all the members (n exceptin!), Layer 2 n a multi-sites IXP is bad! FranceIX will keep its members infrmed abut the prgress f this actin plan. FranceIX thanks its members fr their help and their maintained trust thrugh all the recent events.