Incident Report: Major March 2024 Outage

Incident Report: Major March 2024 Outage #

Pierre Le Fevre, 2024-03-22

In March 2024, kthcloud experienced a significant outage that disrupted our services for several weeks. This detailed report outlines the incident, including the timeline, root cause, and the steps we’re taking to ensure this does not happen again.

servers on fire
Obligatory AI generated image


Preface #

To our valued users, we extend our deepest apologies for the inconvenience and challenges brought about by the recent outage. Your trust in our service is something we deeply value and we understand the impact this disruption has had on your work and projects. We are grateful for your patience, understanding and support during this time, and we appreciate the feedback we have received.

kthcloud remains committed to providing a reliable service, and we are taking this incident very seriously. We are working to implement changes to prevent similar incidents in the future, and we will keep you updated on our progress.

What happened #

The outage began with a failed upgrade of CloudStack on March 1st, our underlying IaaS platform, which led to a series of cascading issues that affected all services running on kthcloud.

One week prior to the CloudStack upgrade, on February 23rd, we had just completed a database migration to a new storage system, and the system was fully operational by the end of the day. Normally, when upgrading CloudStack, the database schema is automatically patched to the new version. However, due to an issue in the recent database migration, the schema was not correctly patched, and caused the CloudStack management server to fail. At this point, running services were still operational, and only the creation of new VMs was affected.

In an attempt to fix the issue, we decided to roll back the database to a snapshot taken before the CloudStack upgrade. However, as some local configuration on the management server had been changed, the rollback failed, and the database was left in an inconsistent state.

Friday night was fast approaching, and we decided to leave the database in its current state, as only new VM creation was affected. We were expecting to resolve the issue on Monday morning.

Narrator: little did they know, things were about to get a whole lot worse

discord chat log warning the server room is too hot
Figure 1, kthcloud monday morning chat log

Monday March 4th, we arrived to find most servers in Flemingsberg had shut down.

The joke that our Flemingsberg server room was actually a sauna became true as the air conditioning had failed over the weekend. This halted any attempts to restore services, as the servers were shutting down automatically to prevent any permanent damage.

We reported the issue immediately, and were provided a temporary solution in the days that followed. This allowed us to begin work on restoring CloudStack, which was going slowly due to the compounding issues.

On March 15th, were able to identify the root cause, a corrupted encryption key, and fix it using a backup. We then began the process of restoring services, which took several days due to the complexity of the infrastructure.

What was impacted #

diagram of kthcloud, showing everything is dependant on cloudstack
Figure 2, kthcloud infrastructure diagram

As all of our workloads run on top of CloudStack, its failure had a significant impact on our services (see fig. 2). All services running on kthcloud, in both Flemingsberg and Kista, were affected by the outage. This was particularly felt by users relying on our infrastructure for their day-to-day operations.

Timeline #

To better understand the sequence of events, we’ve visualized the key milestones of the outage:

TimeDescription
2024-02-23CloudStack MySQL database moved to new storage system, some issues while migrating but all resolved by the end of the day
2024-03-01CloudStack upgrade to 4.19, failed due to issues in database, and unable to roll back due to local configuration changes
2024-03-04All services shut down due to AC failure in Flemingsberg
2024-03-15Issue identified, fix implemented, beginning to restore services
2024-03-22All critical services restored

Takeaways #

The outage has been a profound learning experience for us, highlighting several areas for improvement:

Staging environment #

A staging environment is being built to test large upgrades and catch any issues before they reach production. Previously, we have been using staging environments for individual services, but not for the entire infrastructure, due to time and resource constraints. This outage has however shown the importance of having a staging environment for the entire infrastructure, and we are working to implement this.

Redundant services across zones #

When the Kista zone was added, we decided to only run most of our core infrastructure in Flemingsberg, in order to keep complexity low. We knew this increased the risk of a single point of failure, but decided to accept the risk. This outage has shown that the risk was not acceptable, and we are working to spread out services across multiple zones.

Our DNS servers have already been replicated in Kista, and we are working to replicate more critical services.

Spreading out services to multiple cooling systems #

As we have been testing the limits of our cooling system for a while, we think the best solution is to spread out services across multiple cooling systems. This will also help with redundancy, as a failure in one cooling system will not affect all services.

The original AC was repaired, and we are able to keep the temporary solution in place for the foreseeable future, so there is no imminent risk of another failure. As the move to multiple cooling systems is a large project, we are working on a plan to implement this, and will keep you updated on our progress.

Next steps #

While all critical services have been brought back online, some services like our hosted Stable Diffusion and Llama 2 instances are still offline. Work is ongoing to restore them.

We are working on a large upgrade to the way workloads are ran on kthcloud, which will allow us to remove CloudStack entirely and run workloads directly on Kubernetes. This will simplify our infrastructure making it more manageable for our lean team. Look out for any VM v2 announcements in the coming months.

In Conclusion #

We are committed to transparency and accountability as we move forward from this incident. The lessons learned are being applied to every level of our operations, with the goal of preventing future outages and maintaining the trust you place in us.

If you have any questions, concerns, or feedback, please don’t hesitate to reach out on Discord. Your input is invaluable to us as we work to make kthcloud a more resilient and reliable service for you.

As a student run cloud, we are always looking for new members to join our team. If you think this report was interesting, and want to help us prevent future outages, please reach out to us on Discord.

Thank you for your continued support.