Recent Downtime Review

Posted: 5th Apr 2021

tl;dr

On the 30th, 31st March and 2nd April, Servd suffered some periods of downtime on its eu-west-1 cluster. This was caused by a large DDoS attack against our EU infrastructure in an attempt to ransom one of our clients.

Timeline

  • 15:03 UTC 30th March: the inbound traffic to our eu-west-1 cluster dropped to 0req/s.
  • 15:06 UTC 30th March: we were paged about the issue and began investigating.
  • 15:14 UTC 30th March: inbound traffic resumes at normal levels.
  • 15:40 UTC 30th March: the problem is isolated to the cluster's primary load balancer. Ticket raised with Digital Ocean (our server provider)
  • 17:17 UTC 30th March: after a brief chat with DO they conclude that there's no problem with the LB and that the issue must have been with the datacentre network. Support ticket is closed.
  • 15:00 UTC 31st March: the inbound traffic to our eu-west-1 cluster dropped to 0req/s.
  • 15:04 UTC 31st March: we were paged about the issue and began investigating.
  • 15:10 UTC 31st March: support ticket with DO reopened
  • 15:20 UTC 31st March: we're informed by a client that they have received ransom demands from a potential attacker, but no specific mention of an ongoing attack. Info is passed onto DO to confirm or deny based on network traffic.
  • 15:27 UTC 31st March: DO respond with logs demonstrating that the problem is an internal one with the LB which is experiencing timeouts on requests to internal servers. No further updates or resolutions provided.
  • 15:55 UTC 31st March: Incoming traffic resumes at normal levels. Based on the info we had, and knowing the DO were actively looking into the failed LB, we incorrectly assumed they had found and resolved the issue.
  • 1st April: Chased DO multiple times for a debrief. No useful response.
  • 11:02 UTC 2nd April: the inbound traffic to our eu-west-1 cluster dropped to 0req/s. We're now assuming this is related to the ransom demand received by the client.
  • 11:10 UTC 2nd April: we engage a secondary load balancer and switch all sites using CNAME records for their DNS to point to the new LB. Within a few minutes these sites are back online. The IP address is also published for A record DNS users to update to manually.
  • 11:20 UTC 2nd April: we begin moving the targeted client away from our primary infrastructure. This doesn't solve the issue for other impacted sites, but does demonstrate to the attackers that by directly attacking Servd's infra, they won't take down the intended website, hence, they're wasting money.
  • 13:33 UTC 2nd April: the attackers update their attack to include the new load balancer (perhaps assuming the target website might be using it?)
  • 13:36 UTC 2nd April: another emergency LB is spun up and CNAME DNS websites are pointed towards it. These sites suffer another ~10 mins downtime.
  • ~14:30 UTC 2nd April: the attack ends and all incoming traffic returns to normal

What happens during a DDoS Attack?

There are several lines of defence against DDoS attacks which exist within Servd and have been utilised in the past.

Small Attacks

Small DoS attacks normally occur at L7 and attempt to overwhelm a site by forcing it to process a large number of HTTP requests. These can be mitigated by project-specific rate limits applied on a per-ip basis. This prevents the requests from reaching your application code.

Medium Attacks

Servd has a global rate limit which applies on a per-ip basis. Any individual IP address which performs too many requests against the entire infrastructure is throttled, preventing the requests reaching client projects.

Large Attacks

Digital Ocean's network attempts to scrub DDoS traffic at the network level and should also be effective in preventing large L7 attacks. Once this type of traffic is identified it is scrubbed and dropped within the DO network.

XLarge Attacks

Once an attack is large enough to potentially overwhelm a data centre's networking capacity, the target of the attack (normally a set of IP addresses) are 'black-holed', which results in all traffic sent to them, legitimate or not, being dropped at the network level.

How large was the attack on Servd?

It's difficult for us to say for certain as we don't have direct access to the network level stats, and DO still aren't talking to us about it (the support ticket is still open and unanswered as of today, 5th April). However, after discussing with the targeted client who's other internet properties had also been impacted by the same attack, the total size of the attack was determined to be ~50Gbps.

For comparison, this is in the top 6% by size of all DDoS attacks as measured by Cloudflare.

Because of the large size of the attack, we fell into the XLarge Attack category as defined above, and Servd's targeted IP addresses were simply switched off by the data centre for the duration. Because this occurred within the data centre's network switches, it might also explain why Digital Ocean had trouble tracking down the root cause during the first two outages (although that's currently speculation).

Why was Servd targeted?

We were hosting a client which they were attempting to ransom. The attackers therefore decided to target us, as their host, and all of the IP addresses which were related to our EU infrastructure.

Once the attackers were certain that attacking Servd itself was no longer having any impact on the target website, they moved on.

How are things looking now?

Everything has been operating normally since the attack ended. We do however need to talk to clients in order to rearrange DNS records which were updated during the attack. We currently have arbitrary projects with their DNS records pointing to different load balancers. We'd like to get that back into an organised state so that we can effectively implement changes and updates as required further down the line. We'll be reaching out to individual clients to discuss.

What can we do to prevent something similar happening in the future?

We're actively looking into several options.

  1. During the attack on 2nd April we had emergency calls with two cloud DDoS protection providers. Both quoted an amount which would require Servd to significantly increase its pricing to provide services which would allow Servd to operate as it currently does (allowing clients to own their own DNS records) and also protect against 50Gbps+ attacks. We'll be speaking to other providers to see if this is a viable option anywhere else.
  2. Private IP addresses and smaller IP pooling. This would involve splitting clients into smaller groups, with each group being assigned to an IP address. This would reduce the impact radius of an attack. We may also introduce private IPs and load balancers for individual projects as a plan add-on.
  3. Internal DNS management. Projects which were using CNAME records for their DNS experienced a much shorter downtime period as we were able to update the DNS records for those projects from our side. By bringing DNS management fully in-house we would be able to offer a similar problem mitigation speed, whilst avoiding the APEX CNAME record problem. However this is yet to be technically explored for viability.

Other things we're currently thinking about in response to the attack:

  • Potentially switching our underlying server provider away from DO. Some of their services have allowed Servd to offer all of its current functionality at a reasonable price-point, but their support response during this episode was very poor. Servd is also a digital ocean partner with an assigned account rep. When we reached out to that rep during the attack we learnt that all account reps for partners had been removed at some point in the last 6 months and the partner portal (where we'd normally go for further contact info) had been removed, with no associated communication with partners.

    Any such switch in provider would be a multi-month process and the viability in terms of features and costs would also need to be analysed beforehand.

A personal note

As a company with a small team, we share the frustration and stress which events like this can create and we'll do our best to plan for any similar occurrences in the future. I apologise to any clients which have been impacted negatively by this attack. If you have any further questions or comments then you can drop me an email and I'd be happy to chat.

Matt