Migrating a Major SQL Server Workload to Azure: The Good, Bad and Ugly -- Redmondmag.com

Migrating a Major SQL Server Workload to Azure: The Good, Bad and Ugly

You're going to need a bigger network connection.

By Joey D'Antoni
08/06/2018

My last couple of months have been filled with a major migration to Microsoft Azure for a client, who has both a very large data warehouse and an online transaction processing (OLTP) Web site. In all, we moved around 50TB of SQL Server data into Azure VMs.

The best metaphor for this project came from a manager, who said, "It's just like moving houses -- you find stuff you didn't know you had and at the end, you get to pay two mortgages." (It should be noted that our financial clock was ticking because the original servers were in a colocation facility, where they were accruing charges every month.)

This migration faced a number of challenges, the biggest being that the source servers had 1Gb network cards, ultimately limiting the amount of data we could move at a time. We also consolidated some servers for cost purposes.

The First Migration...and the Freak-Out
When we started the migration of the OLTP system, we ran in parallel testing for a few weeks and verified that all jobs and connectivity were working.

Unfortunately, I wasn't as comfortable as I would have liked with that load testing. The servers in the current colocation facility were very powerful and, more importantly, had really good solid-state device (SSD) storage. Azure premium storage is fast, but does not typically match the latency of fibre-attached SSDs. During this testing, I fielded a couple of complaints about slow queries in the Azure environment, which were remedied by some index tuning.

The migration used log shipping and the transition happened cleanly. For the application tier, we used Azure Traffic Manager, which works as a global load balancer, allowing traffic to flow to Web servers in multiple locations. Web traffic was pointed at both the colo and Azure seamlessly for the migration.

As soon as we cut over and started seeing load, server response time became horrendous. I identified a missing index and this improved the I/O performance overall, but after removing that bottleneck, there were more. We attributed the performance problems to two issues: To mimic the original environment, we had decided to use failover cluster instances with cluster-shared volumes from Windows Server 2016, and we had decided to use a series of virtual machines (VMs) that offered threads instead of cores.

There was no direct metric that showed issues with clustering, but I/O latencies decreased dramatically when the second node was powered off (which you shouldn't do -- it's not just a high-availability thing, it is unsupported with cluster shared volumes). Shifting from threaded VMs (VMs using hyperthreading) to core-based VMs provided us the biggest improvement (E-series VMs to DSv2 VMs). The storage throughput dramatically increased. We noticed this immediately when we restored a backup for testing. The restore happened in half the time of the older environment. The whole migration was on the verge of failure before we elected to change the VM sizes and remove clustering from the solution.

The performance issues highlighted the fact that the OLTP workload had some opportunities for enhancement, but in general, once we switched VM sizes (and in doing so, reduced core count and memory), performance issues were minimal.

Heavy Lifting: Moving 50TB to the Cloud
This was where having a bigger network connection would have been really handy. The data warehouse servers are big. Each one has about 15TB of data, and there are three of them.

There was an extensive testing process that encountered the same performance issues with the chosen threaded VMs, so we switched early to core-based VMs (in this case, the Azure G-Series). The testing process lasted until the Wednesday before the migration, which started on Friday -- which meant I didn't have a lot of time to stage data. There was limited time to get things like mirroring, log shipping or distributed availability groups for all of the databases. I was able to get mirroring configured for a few of the largest databases, however, which was extremely helpful.

The good news was that for this migration, there was a two-day outage window, which meant we could use backup and restore as a migration method. The challenge with that was the sheer amount of data volume and the limited pipe to the cloud. To mitigate that, I did a mix of copying backups, writing backups from on-premises to Azure, and sending some via carrier pigeon. Full backups started moving on Thursday morning and finally finished Saturday morning.

After the restores were completed, we created a couple of Always-On Availability Groups in Azure, so I had to do another set of backups and restores (and some Availability Group automatic seeding). Fortunately, this activity all took place in Azure, with better network throughput.

The other toolset I used and want to mention is an open source project called dbatools. There a number of PowerShell cmdlets that make doing backups and restores much easier at scale, as well as deploying jobs and log-ins across multiple servers. This greatly reduced the amount of work needed to perform point-in-time restores and configure new servers.

After go-live, we had a couple of minor errors, but otherwise the migration was seamless -- just the way it should be.

Final Learnings
You can never test enough for a migration like this, but in most cases, you won't have enough time to perform the testing you really need to do.

Also, performance testing is hard. The commercial tools to perform load testing are expensive and can be difficult to work with. It is possible to perform zero-downtime migrations with minimal data volume, but as your data volume increases, it becomes much more expensive and challenging. The techniques are the same; the plumbing just needs to be way bigger.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.

Featured

Multi-Tenant Organization Hooks Released for Microsoft 365

Microsoft Entra ID's new multi-tenant organization capabilities have reached general availability (GA).
Real-World Ransomware Recovery

Here's how I was (mostly) successful in recovering a family PC infected with malware.
Microsoft Launches Small AI Models Family Phi-3

Recognizing that bigger isn't always better, Microsoft has released its smallest open AI models to date, Phi-3.
Microsoft and OpenAI Continue Global AI Expansions

Microsoft and OpenAI each continued their global expansion this month, with the announcements of new infrastructure investments and service rollouts in new regions, like Asia and the Middle East.
Cisco Warns of Brute Force Attacks Targeting VPNs and Firewalls

Cisco has revealed that a global increase in brute force attacks targeting Virtual Private Networks (VPNs) and other devices is currently taking place.