September 27, 2023

COSFONE

Networking, Computer, PBX, IT, DIY Solution

17 production-grade databases were deleted after a code typo caused Microsoft Azure to fail. 

4 min read

17 production-grade databases were deleted after a code typo caused Microsoft Azure to fail. 



 

17 production-grade databases were deleted after a code typo caused Microsoft Azure to fail. 

 

On May 24, Microsoft Azure DevOps failed at a scale-unit in the southern region of Brazil, causing a downtime of about 10.5 hours. 

Recently, Eric Mattingly, principal software engineering manager at Microsoft, apologized for the outage and revealed what caused the outage: namely, a simple typo caused 17 production-grade databases to be deleted.

 

The background to the incident stems from the fact that Azure DevOps engineers sometimes need to save snapshots of production databases to investigate reported issues or test performance improvements. 

To ensure that these snapshot databases are cleaned up, there is a dedicated background that runs daily and the system deletes old snapshots after a set period of time.

 

During Sprint 222, Azure DevOps engineers upgraded the code base, replacing the deprecated Microsoft.Azure.Managment.* packages with the supported Azure.ResourceManager.* NuGet packages. 

This move has resulted in a large number of pull requests seeking to replace API calls in the old package with API calls in the new package. 

And that hides a typo in the snapshot delete job, which replaces a call to delete the Azure SQL Database with a call to delete the Azure SQL Server that hosts the database.

 

According to Eric, the conditions for running this code are rare, so the testing mechanism is not well covered.

We deployed Sprint 222 to Ring 0 (our internal Azure DevOps organization) using our Secure Deployment Practices (SDP), where the snapshot database did not exist, so the job did not execute. After a few days of Ring 0 deployment, we next deployed to Ring 1, where the affected southern Brazil scale-unit is located. Where the snapshot database was old enough to trigger the error code, when the job deleted Azure SQL Server, it also deleted all 17 production databases in the scale-unit. Since then, the scale unit has been unable to handle any customer traffic.

 

17 production-grade databases were deleted after a code typo caused Microsoft Azure to fail. 

 

 

Azure DevOps engineers detected the outage within 20 minutes of the database deletion starting and started working on a fix. At present, all the data has been restored, but it took up to ten hours. Mattingly explained several reasons for this:

  • First, customers cannot recover Azure SQL Server themselves, so Azure SQL Server must be recovered by the Azure SQL team. “Decided that we needed an on-call engineer for Azure SQL, got them involved and recovered the server, which took about an hour.”
  • Second, databases have different backup configurations, some configured as Zone redundant backups and others as the newer Geo-zone redundant backups. Reconciling this mismatch adds significant time to the recovery process.
  • In the end, after the databases started coming back online, even customers whose data resided in those databases were unable to access the entire scale-unit due to a complex set of problems with the web servers.

 

According to the introduction , these problems stem from the server warmup task, which traverses the list of available databases through test calls. 

A bug in the database during recovery caused the warmup test to “execute an exponential backoff retry, causing a warmup that normally takes less than 1 second to take an average of 90 minutes.”

 

To complicate matters further, this recovery process was staggered, and as soon as one or two servers started accepting customer traffic again, they became overloaded and failed.

 Ultimately, restoring service required engineers to block all traffic to the scale-unit in southern Brazil until everything was ready before rejoining the load balancer and handling the traffic.

 

Microsoft said it has implemented various fixes and reconfigurations to prevent the issue from recurring.

  • A bug in the snapshot deletion job has been fixed.
  • Created a new test for the snapshot deletion job that fully enforces the snapshot database deletion scenario against real Azure resources.
  • Adding Azure Resource Manager locks for critical resources to prevent accidental deletion.
  • Make sure all Azure SQL Database backups are configured as Geo-zone-redundant.
  • Make sure that all future snapshot databases are created on a separate Azure SQL Server instance from the production database.
  • The logic in the web server warmup task is being fixed so that it starts successfully even if the database is offline.
  • A new cmdlet is being created to restore a deleted database to ensure that the restore uses the same settings as before the deletion (including backup redundancy).

More details can be found in the official announcement .

 

 

17 production-grade databases were deleted after a code typo caused Microsoft Azure to fail. 




DISCLAIMER OF COSFONE.COM

Copyright © All rights reserved. | Newsphere by AF themes.