US-EAST-4 Outage and Data Centre Fire
Posted: 14th Jul 2023
This post refers to the incident which we tracked on our status page here: https://status.servd.host/notices/ihbni93vusslj4q4-us-east-4-data-centre-issue
On the evening of the 10th July a fire broke out in one of the data centres that Servd uses to host projects. The fire resulted in a complete loss of electricity feed to the servers.
We managed to migrate the majority of impacted projects over to a cluster in another data centre from backups.
The data centre came back online in the early hours of the morning on 13th July with a total of 54 hours service interruption for any projects which were not migrated.
Although it is difficult to anticipate such rare events, we're putting together some plans to help with any similar occurrences in the future.
Who was involved?
- Servd (us!) who provide the server and project management layer for your Craft sites.
- Civo - our server provider for the impacted cluster. They put the physical servers in the data centre racks and rent them to us to run our workloads on whilst also providing us with additional 'cloud' services to make that easier.
- Evocative/INAP - the owners/operators of the data centre. They own the building, run the site and are responsible for ensuring physical security and access to essential utilities for the servers racks they contain.
- 21:50 - Disruption begins and is picked up in our monitoring.
- 21:55 - We reach out to Civo to check for any active incidents.
- 21:58 - We create our initial notification within our status tracker.
- 22:10 - Civo confirm that there's an issue in the data centre and that they are investigating.
- 22:46 - Civo get confirmation from the data centre that there has been an interruption to the power supply to the building and relay that to us.
- 00:18 - As we have no ETA on recovery we begin offering customers the option to migrate projects to an alternative cluster and restore from backup.
- 02:00 - The data centre confirm that their UPS systems have had to be shut down and that they're working on the problem but with no firm ETA on recovery.
- 02:13 - Cause is confirmed as a fire and subsequent safety procedures.
- 03:35 - Work begins in the data centre to replace damaged electrical and fire suppression systems, we're told to expect an update at 12:00 UTC.
- We continue to migrate projects for clients who reach out to confirm it's ok.
- 13:30 - We're informed that a further 3 hours is required for work to be completed to make the site safe.
- 18:00 - An inspection is carried out by the fire marshals to determine safety before re-engaging the data centres main electricity connection. The result of this inspection is a request to replace specific portions of the fire suppression system with a re-inspection taking place on July 12th at 13:00 UTC.
- 19:39 - We send out an email to all members of impacted projects urging them to get in touch to begin migrating their projects to another cluster in a different data centre.
- We continue to migrate projects upon request.
- 15:55 - The re-inspection is completed and the site is signed off as safe. Power is restored and critical building systems brought online.
- 18:03 - Main critical components are online and checked - an environmental reset is started to clean the air and reduce the temperature of the building before server racks are switched back on.
- We continue to migrate projects upon request.
- 01:13 - Civo's server racks are powered on and they begin work on reinitialising their systems.
- 05:31 - Our platform becomes accessible from the internet and we work to fix any software issues caused by the incident.
- 06:42 - All projects are confirmed as running stably and we issue a command to trigger a backup of data for all projects which have been migrated to another cluster.
- 08:17 - We send out an email to all members of impacted projects notifying that incident has finally been resolved. The email contains steps to take to restore database data for projects that migrated to another cluster.
What happened in the data centre?
The root cause of the outage was a fire which broke out in one of the electrical rooms within the data centre (DC). The fire originated from one of three UPS (uninterruptable power supplies) that act as backup electricity feeds for servers in the event of a power outage.
The fire suppression system in the data centre extinguished the fire automatically and the fire brigade was automatically called. We saw an initial degradation in availability as the fire caused a fluctuation in the electricity feed, resulting in some hardware within the DC restarting.
Once the fire brigade arrived at the site, they ordered the main electricity feed to the building to be disabled as is their procedure during a significant electrical fire.
Upon learning that the fire broke out due to a UPS system, the remaining UPS systems were also ordered to be disabled while a safety inspection was carried out. This led to complete loss of electricity feed to all servers within the data centre and loss of availability.
Once the building had been declared immediately safe by the fire marshal, a full safety audit and replacement of fire suppression systems within the impacted area was started. This work took place over the evening of 11th July with the final safety inspection occurring by the fire marshal at 13:00 UTC on 12th July.
Once the safety audit had been completed the DC was able to bring its electricity feeds back online and begin preparations for booting up servers. This involved a full environmental reset - venting any impacted areas to clear residual smoke / fire suppressant etc and cool the building back down to optimal temperature.
Once servers started coming back online in the evening of the 12th, we worked with Civo to ensure that our workloads came back online in an expected state.
How were projects migrated?
Project migrations between clusters on Servd currently require two things:
- Restoration of the project components from a database backup
- Repointing any associated DNS records for the project's domains over to the new cluster
Technically, the restoration from backup takes us between 2 and 10 minutes, depending on the size of the project's database data, however we need to get customer confirmation before performing this task for two reasons:
- Restoring from a backup loses any changes in the database since the backup was taken. We need to make sure that this is ok. Even if we know that we'll be able to get a full copy of the project's most recent data back later (as was the case with this outage) restoring from backup would result in two 'branches' of different data that can't easily be merged later, so we need customer confirmation to determine the best course of action for individual projects.
- We need to make sure the customer is able to update DNS records. This can be a challenge for many of our customers and we therefore can't assume they will be able to do so within any specific timeframe.
Once we had confirmation from a customer, the migrations were enacted as soon as possible. In this instance most were completed within 10 minutes, however there were two periods in which this grew to around 30 minutes as demand increased.
Note: a couple of project migrations were delayed for longer than the average because their requests were sent by email and were filtered as Spam in our support system. We've since rectified that by disabling all spam filtering.
After the project migration, customers were able to update their DNS records allowing projects to come back online in their new home.
Was any data lost?
No data was lost during this incident. The fire did not impact the servers we use directly and when systems came back online all data was present in the state it was in immediately before the power was switched off.
In order to allow any migrated projects to update from their backup data back to their real-time pre-incident data, we executed a database backup in the restored eu-west-4 cluster for all migrated projects. This created a 'Manual Backup' entry in our dashboard which contained the pre-incident project data. We then left it up to our customers to determine whether or not they would like to keep their current data, or restore the pre-incident data. The restoration process was a single button click to 'Restore Backup' from within our dashboard.
What is being done to help with similar occurrences in the future?
Overall we were happy with the way our disaster recovery process performed, but there are some things we'll be adjusting as a result of this event. It's worth stressing that the root causes, a data centre fire and subsequent safety protocols put in place by the fire department, are extremely rare and impacted hundreds of the DC's clients (including Facebook who had services colocated in the same building).
We were also pleased that our strategy of splitting our platform into multiple clusters, hosted in several different data centres, resulted in only a subset of Servd projects being impacted, and also allowed us to migrate projects between pre-prepared clusters immediately.
However, some small things that we are going to improve:
- Change the default backup regularity for Starter projects to daily instead of weekly.
- Create internal tools for project cluster migrations instead of actioning them manually. They only take a couple of minutes each, but when there's a lot of them it's better to automate it all!
- Disable any spam filtering on our support email to prevent legitimate emails not being noticed immediately
And some larger changes that we'll be investigating:
- Geo-redundant, real time data backups as an optional plan addon. This would keep an up-to-date copy of a project's database data in a secondary data centre, which can then be used as a data source for project restoration in the event of a full DC blackout.
- Adding a CDN layer in front of all Servd services. This is something we've been investigating for a while, but have struggled to find a good solution whilst still allowing apex domains to be pointed at us. If we did have a CDN in place, we'd be able to reroute incoming project traffic between data centres without needing a DNS update, bringing us much closer to the ability to fail over projects from backup within minutes and without customer involvement. We're currently speaking to Cloudflare who we believe would be able to offer a solution that would work for us.
What other lessons have been learned?
- Not everyone subscribes to our status notification channels. Next time we'll be sure to send out emails to everyone that we think might be impacted by such a serious incident, rather than only those who have subscribed for our status updates.
- Events like this aren't easy for our customers, our team or our upstream providers. All levels within that hierarchy have been negatively impacted and we see it as our responsibility to try to shield our customers from that as much as possible.
- Our customers are very nice, thanks for all the words of support!
What compensation are you providing?
As per our standard SLA - many projects within the us-east-4 cluster dropped below our expected 99.9% uptime for this month. If you'd like to receive a full refund for the month's costs, then please drop us an email at [email protected] and just mention the project name(s) and we'd be happy to oblige!