Microsoft Looks to Yukon for Data Mining Gold

Of the dozens of feature sets that Microsoft added or improved since its last SQL Server release, one area that received a particularly significant overhaul is data mining. So much so that Microsoft executives contend data mining could go mainstream when SQL Server 2005 ("Yukon") ships in the second half of this year.

Jamie MacLennan, Microsoft's data mining development lead for SQL Server, describes three pieces of a puzzle that will make Yukon an "accelerating factor" for data mining:

  • The bundling of new business intelligence, data warehousing and other database technologies into the core database at no extra cost will lead to broad deployment of the technology, although it won't guarantee use.
  • Microsoft's focus on ease of use and integration with developer tools (Visual Studio 2005 is to ship simultaneously with SQL Server 2005) should spur usage.
  • The low cost compared to traditional data mining tools will leave customers with money to invest in third-party tools or services to get their data mining projects off the ground.

"A huge number of customers will have data mining functionality licensed in their enterprises," MacLennan says. "Before, people had to do a million-plus dollar investment in data mining tools." That left little money for customers to spend on third-party consulting firms to help with their implementations.

Microsoft points to the OLAP database world as an example of what could happen.

"Before SQL Server 7.0, OLAP was a niche technology with high-end consultants and expensive tools. Now there are actually more consultants, but you also have more IT shops doing it themselves. One major leg of the cost is taken away," he explains.

If some of this sounds familiar, it is. Five years ago, Microsoft had similar hopes of spurring mainstream adoption of data mining. It included mining capabilities with the OLAP engine in SQL Server 2000 as part of a business intelligence package called Analysis Services.

A major difference with Yukon, according to MacLennan, is time. With SQL Server 2000, Microsoft decided to add data mining functionality late in the product cycle. "In Yukon, now we've had a long product cycle to develop a robust feature set."

Data mining has been around for a long time, but it's still a somewhat mysterious and little-used art. The idea is to take a huge set of data and run mathematical algorithms against it to find hidden patterns and relationships.

The root of data mining involves statisticians working with existing data sets to create data models that can then be used within real applications to find correlations or predict events. Examples of applications that benefit from data mining algorithms are credit checks, airplane engine failure predictions and oil/gas exploration.

One of the limits on data mining in SQL Server 2000 was that it had only two algorithms—a small number relative to other data mining tools. Microsoft added seven more algorithms in Yukon, including regression trees, sequence clustering, association rules and time series. It also included a capability called text mining, a tool for finding trends in unstructured data such as e-mails and documents.

Microsoft isn't playing up the new algorithms much. Data mining users get the most benefit from decision trees and clustering algorithms that already existed in SQL Server 2000, MacLennan says: "I would say the algorithms are the smallest part of it."

Instead, Microsoft focused its efforts on areas where the company often succeeded in the past: ease of development integration, ease of use for end users and partner opportunities.

The database and developer teams worked closely to make it easy for developers to deploy a data mining model. "I can build a model, and I can put it into production with four lines of code. It's trivial," MacLennan says. "Or you can take [SQL Server] Reporting Services, and put that on top of your models, or [SQL Server] Integration Services. You can take this high level work and start realizing ROI much quicker."

Starting in Yukon, third-party algorithms will be able to plug in to the database at the same low level as Microsoft's own algorithms. That's a change from SQL Server 2000, when vendors attached their algorithms to the database through an abstraction layer. The new approach should result in faster performance and better scalability.

Still, Wayne Eckerson, director of research with TDWI (a sister organization to Redmond magazine), sees stumbling blocks to data mining becoming widely used. "The bottom line with data mining is that creating models and scoring records is not for the masses. It's for very specialized people with statistical skills. However, the output of what those folks do can be generally applied," Eckerson says.

Other vendors, like NCR with its Teradata database, are also investing in making the data-modeling process more seamless and with more massive scalability, Eckerson says.

But Microsoft does have strength in its ability to integrate with developer tools to make it fast and easy to port data models into real applications. "That's probably where Microsoft is spending more of its time," Eckerson says.

About the Author

Scott Bekker is editor in chief of Redmond Channel Partner magazine.


comments powered by Disqus

Subscribe on YouTube