Joey on SQL Server

What Can Fix Microsoft's Support Challenges?

It's not just Azure, though Azure support is a big culprit. So what can Microsoft do to fix it, and why is the answer AI?

Microsoft Azure has been wildly successful. What started as a weird combination of Windows Azure and SQL Azure has evolved to become one of the world's Top 2 hyperscale computing platforms. Today, there are over 50 Azure regions and thousands of service offerings, and there are effectively infinite numbers of permutations of those services and regions.

As you can imagine, this is an incredibly complex stack of hardware and software.  Having this much functionality in the platform means that sometimes customers have problems. And when they do, they open support tickets.

While the Azure service has made incredible strides in capability, uptime and performance, Microsoft support has headed in the opposite direction in the last decade.

A customer I know recently had a SQL Server memory dump, leading to an availability group failover. Traditionally, support quickly dealt with SQL Server memory dumps by running the crash dump through some debugging tools and providing the customer with feedback on whether it was a systemic or localized problem. Support would also state whether the product team would fix the bug in an upcoming update.

While this customer did not have an ongoing support contract with Microsoft, the support process ultimately disappointed them, offering no resolution.

That customer asked me if purchasing ongoing support would have helped them achieve a better outcome. On the morning I was replying to this e-mail, I was on both the Azure and sysadmin subreddits, and there were at least three or four posts from that week about the current state of Microsoft support and how generally terrible it is. The posts paint a consistent picture: dealing with outsourced support engineers who seemingly don't read the tickets, don't respect the communications request of the customer (for instance, calling instead of e-mailing), and tickets just getting shuffled from group to group.

The only positive feedback came from larger customers with a dedicated support engineer from Microsoft, whose tickets nearly always bypass the first couple of levels of support.

In my day-to-day work, I've had very similar experiences with the Microsoft support process. Recently, as part of a case I had opened, I asked the engineer to schedule a meeting at a specific time, but he never replied, only sending me a meeting request 10 minutes before the proposed time.

I have also worked in support, so I know how challenging and pressure-packed those roles are. The support engineers are doing their best in a difficult situation. So how does Microsoft start to think about fixing this?

The first step in fixing any problem is to identify what the scope of the problem is. The biggest problem with the state of Microsoft support is the size and scope of both Microsoft's customer base and the Azure platform. This scope makes hiring and training enough engineers to support the platform exceedingly challenging.

I estimate that there are about 500,000 Azure customers, so let's use that number as a baseline. Assuming each customer creates 10 support cases a year, that's 5 million support cases you have to deal with. However, that's likely a gross underestimate, as widely used tools like SQL Server and Windows also have support cases pulling resources out of the same pool.

While Microsoft could and should spend more money on support, I really think it is at the limit of what support engineers can do. It is especially hard with outsourced support, as the outsourcing firms are more focused on their own profits rather than training engineers. Bringing some of those roles back in-house could help, but it would require major investment -- and it still might not deliver support improvements.

Imagine the staffing needed to support tens of millions of tickets annually. It simply doesn't scale with humans. This situation calls for a reevaluation of the support system and a collective effort to find a more effective solution.

So what is the solution? I hate to say this, because I feel that we, as an industry, have dramatically overhyped the benefits of AI solutions. However, well-designed AI with a properly designed knowledge base, and some automation tooling, could be incredibly useful at reducing the number of support tickets that engineers must deal with and manage.

Large language models (LLMs) are designed for text analysis, and predictive text -- i.e.,  identifying problem sets and pointing toward other solutions. Azure already has a bit of this; there's some initial classification of your case, and it tries to point the user toward solutions in Microsoft docs.

However, the current systems are ineffective, as is the current implementation of Azure Copilot, which is helpful for a handful of tasks but not broadly supported for trained engineers. The potential of AI solutions in this context is promising and could significantly improve the support experience for Azure users. In many cases, this could be as simple as exposing limited, relevant amounts of Azure backend logging to help a customer understand why an outage or performance degradation happened.

AI is not a panacea. Many support cases could be mitigated just by ensuring that correct and meaningful error messages are shown to users. Improved QA would also help to provide a better feedback loop to engineering to build more robust services. (Of course, the rub is that QA has largely been replaced by automated unit testing.)

Supporting a platform as large and complex as Azure, along with the rest of products Microsoft sells, is an incredibly difficult task. The sheer scale of customers, the number of services and the worldwide exposure all mean that a tremendous amount of support engineers are needed around the globe. The current situation is challenging for both customers and support staff, and -- based on my personal experience, as well as reports on the Internet at large -- it's clear that good customer outcomes are not a guarantee.

Microsoft has made efforts to address the problems with support a couple of times in recent years, but it's clear that more needs to be done. AI can help, but it's not a replacement for making necessary investments in support and automation.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

comments powered by Disqus

Subscribe on YouTube