Incident summary
During routine maintenance, the movement of a network cable brought to light a dormant issue in one networking component, resulting in a degraded connection. Later, while trying to fix this issue, an unrelated network cable was accidentally disconnected for a few seconds. As our infrastructure is built redundantly, both incidents should have had no impact. However, a software malfunction prevented another connection to take over on both system. This put them into an unstable state that resulted in a partial unavailability of our services between 15:20 and 19:45 CEST.
Resolution and recovery
The affected systems were taken out of our production pool, and customer VPSs and services were relocated on other hardware. Some customer VPSs had to be restarted as a smooth migration from one of the systems was impossible or would have been very slow.
Affected services/impact
- E-mail (partial):
- Impact for customers: 20% of customers couldn't read mails (sending still worked)
- Impact time: 15:20 – 16:51
- Hosting (partial):
- Affected customers: 45% of hosting customers
- Impact for customers: unavailable websites, scheduled tasks not running, unability to access SFTP / shell
- Impact time: 15:20 - 17:08
- VPS (partial): - Affected VPSs: 67 (on two hardware machines) - Impact: VPS unavailable until rebooted - Impact time: 15:20 – 17:22
Corrective/Preventative actions
-
We are evaluating the software procedure that was in control of the failover in case of an incident like this, it will be replaced by another mechanism that we are successfully using in other parts of our infrastructure.
-
Due to holidays and Covid-19 regulations at the datacenter, our response, while adequate, took longer than we would have liked. In future cases we will make sure we have more available engineers even for those routine maintenances like the one that was scheduled last week.
-
It turns out that despite our efforts to distribute our services between different hardware, the methodology of this can be improved to further lower the impact of a single system malfunction.
Timeline (times in CEST)
- 15:20 A hardware issue in a networking component manifests following some cable movement.
- 15:21 Services on the affected system slowly and gradually start to misbehave due to starvation on network traffic
- 15:45 While trying to change the components to fix the issue on the first system, a cable is accidentally disconnected for a few seconds on a second system
- 15:46 Services on the second system also start to misbehave
- 15:55 It becomes clear that the systems are not responding as expected, the decision is taken to relocate the services on them.
- 16:48 About half of the affected websites comes back online (of 45% of hosting users affected).
- 16:51 E-mail services are again fully functional.
- 17:08 The remaining websites are back online.
- 17:22 Half of the affected customer systems relocated or restarted
- 19:20 All Affected customer VPSs are restarted or relocated.
- 19:45 Everything is confirmed to be back online.
About this report
Greenhost believes it is important to be transparent about our operations, this includes downtime and security incidents. While we build our infrastructure and processes to be resilient in case of an incident, outages can, and will, happen to any IT organisation. It is our belief that it is better to show what happened and how we learn and improve our infrastructure, rather than to leave our users in the dark about it.