What follows is a relatively detailed and technical description of the reasons for the outage. Because we understand not everyone wants or understands this much detail we added a simpler and shorter summary of the problem:
There was a problem with the devices (switches) that regulate the (redundant) internet (network) connections to our data centre. In trying to solve this isolated problem unfortunately a feedback loop occurred that clogged the multiple connections.
We have multiple connections going to the data centre to prevent losing access to our machines, however this issue managed to block all those lines so we had to physically travel to the data centre to restore the connection.
When the connection was restored we noticed that the servers inside the data centre had not been able to communicate with their storage servers for the duration of this problem. Because of this they had to be restarted to restore their connection.
As the result of a malfunction originating in one of our switches the Greenhost network in Amsterdam (Haarlem) has been unreachable between roughly 15:15 and 16:30 CET. After the network connectivity was restored, it was apparent that access to the storage systems could not be re-established for all systems. The only way to restore this was to reboot all virtual machines in the network. Impacting all Greenhost services.
It took till 18:00 to restore all major services, and until 18:30 before all virtual machines were rebooted and available. Individual (cloud) VPS clients might have faced additional issues related to their machines rebooting.
There has been no data loss and all delayed email has been delivered afterwards.
Our network has a fully redundant design to be fault tolerant. However one of the mechanisms (STP) in this design was the root-cause of the failure.
10:30 - Our engineers confirmed a problem in the traffic propagation of one of our customer VLANs through the network infrastructure.
14:00 - After several hours in pursuit of the source of the problem it was found that one of our core switches experienced an STP re-election earlier that morning and was since then discarding the traffic of the VLAN in question.
14:30 - After consulting the documentation possible courses of action were discussed to remedy the issue.
15:00 - After agreeing on a possible solution an attempt was made to trigger a new STP root switch re-election which would have in theory enabled proper propagation of traffic of the VLAN in question.
15:15 - While attempting to trigger the STP root re-election a network loop was inadvertently created over one of the access switches. This made a network congestion that resulted in most of our infrastructure being unable to communicate with the rest of the platform or the outside world. Unfortunately the out-of-band access failed, so reverting the configuration was not immediately possible.
15:25 - A team of our engineers was dispatched towards our data centre to fix the problem on-site, while our front office tweeted about the outage and prepared a voice message for clients calling about the issue. The engineers at the headquarters were trying to find a way to restore communications remotely.
16:00 - The engineers arrived at the data centre, but entry was delayed, as it was to crowded at the data centre at that point in time.
16:10 - Once on site the problematic switch was reverted to the original configuration which restored the network. The team in the headquarters immediately started verifying the state of the platform.
16:15 - At this point it was apparent that some of our infrastructure wasn't able to recover from the network outage automatically. With modern approaches to both Web, email and VPS hosting having underlying storage that is heavily dependent on the network to function there is increased sensitivity to network disruptions, especially if timeouts occur.
16:25 - After a quick planning session engineers set out to restart and verify the operation of our infrastructure elements that were showing erratic behaviour after the outage. This affected a large part of our services, including email, Web and most of our VPS hosting customers.
17:20 - Most of our hosting services were back online, an outage notice has been put on our website which was later updated as the situation changed.
18:30 - Everything was confirmed to be back online and functioning as expected, including the problematic VLAN that sparked this issue initially. The final update to our website and twitter account was published.
Based on the events of yesterday, several improvements are scheduled for our operational procedures and network infrastructure design.
- The out-of-band equipment failed. This made it necessary to send staff to the data centre, delaying recovery. The design of our out-of-band facility will be adjusted.
- Although there is strict policy in place for change management, this policy will be improved.
- We are in contact with the data centre how to make sure delays for entry can be eliminated. Although the delay was limited in time, we think this should not have happened in the first place.