Q&A
SQL Server 2019's Big Data Clusters Explained
The biggest feature in the SQL Server 2019 preview launched at Ignite is SQL Server Big Data clusters. Travis Wright, Microsoft's principal program manager for SQL Server, explains exactly what this means for administrators.
- By Joey D'Antoni
- 09/25/2018
Microsoft introduced a new community technology preview (CTP) of SQL Server 2019 at Microsoft Ignite on Monday (you can read about the full list of announced features here).
As part of that announcement came SQL Server Big Data clusters, a scale-out, data virtualization platform built on top of the Kubernetes (K8s) container platform. SQL Server Big Data clusters is a big investment from Microsoft into a number of technologies -- and it is clear that taking one of its best-selling enterprise products and building on top of the K8s infrastructure is a moonshot at modernizing the data estate in most enterprises.
Before Ignite, I had a chance to interview Travis Wright, principal program manager for SQL Server at Microsoft, on the new project.
SQL Server Big Data clusters is a lot of change in the platform in one swoop. While it's pretty revolutionary for commercial databases (elastic scale, in particular), do you foresee any gaps to adoption, and what are you doing to mitigate that?
Wright: Probably the most obvious adoption hindrance will be the K8s/container adoption for database workloads. Companies are getting on the bandwagon, similar to virtualization. [Another hindrance is] container questions. When people see how easy it is to deploy a SQL Server Availability Group into K8s, it makes it a no-brainer.
SQL Server 2017 was already impressive on K8s and did not receive that much attention, but the major limitations were running jobs, Active Directory authentication and availability groups. Are all of those fixed?
SQL agent jobs, no problem. Agent is running within the container. Active Directory authentication, still a work in progress, but it definitely will happen.
Availability groups -- this works and has been public for a while now. The first step is to deploy an operator into K8s. This operator orchestrates the deployment of pods, then connects to all of them and orchestrates the full deployment of the availability group. In fact, this allows for users to perform rolling upgrades (for cumulative updates) with a minimum of downtime. To provide quorum within the availability group, there will be a Paxos implementation within the databases as config map part of deployment, which is architecturally similar to a failover cluster instance. This entire solution will be released as an open source project by Microsoft.
Can you explain how SQL Server Big Data clusters increases flexibility and reduces time to value from data for customers?
Deploying SQL Server alone can be time-consuming. If you have a K8s infrastructure, you can have a complete availability group infrastructure in, like, five minutes. Big Data -- it's a hassle to deploy and manage a Big Data cluster. You can deploy this with a single server cluster in single command and have a cluster in about 30 minutes.
Integrating your relational environment with your Big Data environment has security hassles. With SQL Server Big Data clusters, there is an integrated security tier. This covers SQL Server, Spark and Hadoop Distributed File System [HDFS].
Data virtualization can easily integrate data without having to do ETL [extract, transform, load]. A few key points here:
- How long does it take from the executive coming to you to ask for a dashboard to generate a report? Being able to directly query tables from other systems through the use of external tables greatly reduces this time to generate a report.
- What is the latency of your data going through your pipeline? How old is your data?
This requires options like change tracking, which greatly increase the complexity of your data ingest processes.
How are we scheduling jobs against the SQL Server Big Data clusters?
So, Spark jobs, SQL jobs and control plane jobs.
There is a control plane job execution engine from Azure into SQL Server Big Data clusters, which allows cluster management jobs that are system- or user-initiated. For example, this could be creating a compute plane.
SQL jobs will run in the SQL master instance just like normal agent jobs. Spark jobs -- there isn't a scheduler for Spark. Microsoft will be building a scheduler service for Spark.
What tools are we using to manage the cluster?
Azure Data Studio is the tool that data engineers, data scientists and DBAs will use to manage databases and write queries. Cluster admins will use the admin portal, which runs as a pod inside of the same namespace as a whole cluster and provides information like status of all pods and overall storage capacity.
We have a pipeline of tools to collect monitoring data, including Telegraf, Grafana and InfluxDB, as well as SQL Server and Spark monitoring data at the K8s node level.
There will also be a command-line tool that is equivalent of that admin portal, which will interact with the Rest API for automation of different tasks, e.g., create a compute pool every morning.
Spark users will have access to the Spark UI, with special extensions to show what is going on in job management.
What are your thoughts on best way to load data into databases on the cluster?
SQL is just an endpoint to connect to. All existing data movement tools that connect to SQL Server will still work.
There are two new things that are options for SQL. Loading data into HDFS via the HDFS client, Spark Stream. This is a good model for data exchanges between companies. You now have Spark to create data-wrangling jobs with the Prose library to normalize data, and then use the SQL Server connector for Spark.
You can also use external tables from other data sources, which is the real power of data virtualization here, and load that data into compute pools using "create table as select."
Will SQL Server Big Data clusters support other K8s services beyond Azure (e.g., Google Cloud Platform or Amazon Web Services)?
Azure Kubernetes Service (AKS) and Azure Container Services (ACS) first, and then OpenShift on Azure. Are enough variables exposed to be deployable into a managed service? The configuration type is specified, so we are leaving those options open.
One thing is sometimes overlooked. Not only is SQL Server Big Data clusters doing data virtualization and Big Data, but we are providing a managed Database as a Service (DBaaS) platform, which provides HA [high-availability], monitoring, backup and recovery.
What about standard DBA tasks like backup and recovery and index maintenance?
For backups, at deployment time you'll set destination and retention, and it will likely be a control plane job. Over time we'll add indexing, stats and consistency checks to that array of job.
You can deploy it anywhere where there is K8s. So, you can have consistency about how you deploy the solution, anywhere on any cloud. A lot of what of SQL Server Big Data clusters does is bring cloud concepts to outside of the public cloud, or at lease consistent across all of them, or your management. Management services are built-in to the product.
One other thing to realize is that Machine Learning (ML) Services are running in the master instance. R, Python, Java, store/score/native prediction are natively in the master cluster.
Also, you have Spark ML in the box, so SQL Server Big Data clusters can take advantage of Spark to ML model training/scoring. Working on tying the two things together, to product a model using Spark, converting to MLeap model, load that into SQL Server and use Java extensibility to score that model in SQL Server.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.