Going Big Data: What IT Managers Need to Know
The traditional SQL database is under siege by modern information stores that can process larger volumes of data designed for the cloud era more rapidly. Here's your guide to what's going on in the big data world and the major players to watch.
Unless you've been living under a rock, you know one of the biggest drivers of IT and cloud computing initiatives is the need to gather, process, store and analyze -- often in real time -- "big data."
Businesses and government agencies alike are stepping up initiatives where they're mining everything from their CRM systems and data feeds to tweets mentioning their organizations that can alert them to a sudden problem with a product to a potential market opportunity spawned by an event. Online and big-box retailers are using big data to automate their supply chains on the fly. Law enforcement is analyzing huge amounts of data to thwart potential crime and terror attacks.
Big data drove an estimated $28 billion in IT spending last year, according to market researcher Gartner Inc. That figure will rise to $34 billion this year, Gartner estimates. In addition to pulling data in from social networks, a growing number of big data applications involve machine data from sensors, telemetry systems and other non-human interfaces -- as well as large volumes of unstructured content -- in order to determine trends to deliver insights and intelligence in near-real-time. Gartner noted 10 percent of new IT spending on application infrastructure and middleware is in some way influenced by big data.
Many of the big data initiatives under way now are the result of the growing proliferation of Apache Hadoop-based data stores consisting of the Hadoop Distributed File System (HDFS). Hadoop can run alongside such analytics engines as Apache Hive, originally developed by Facebook before the company contributed it to the open source community.
Hadoop-based repositories let users store terabytes of unstructured information in massively distributed clusters based on commodity servers. Using a rapidly growing market of query tools from traditional suppliers and a slew of startups, users can find and access that content faster than ever before.
Recognizing this trend, the leading providers of relational database management systems (RDBMSes) -- Oracle Corp., IBM Corp., Microsoft and SAP AG -- are placing major bets on Hadoop to let customers manage and analyze huge volumes of data.
Microsoft Previews HDInsight
Microsoft took an important step forward in its quest to bring big data to the cloud in late March when it released the public preview of its Windows Azure HDInsight offering. The service, first made available on a limited basis last fall, aims to let enterprises process big data using Microsoft SQL Server and the Hortonworks Inc. distribution of the Hadoop file store, which the companies emphasize as 100 percent Apache-compatible. Spun out of Yahoo! Inc. with the help of Benchmark Capital in 2011, Hortonworks formed a partnership with Microsoft to enable SQL Server to use its Hadoop distribution.
The HDInsight Service in Windows Azure lets organizations spin up Hadoop clusters in Windows Azure in a matter of minutes, noted Eron Kelly, general manager for the Microsoft SQL Server group, in a March 20 blog post.
"These clusters are scaled to fit specific demands and integrate with simple Web-based tools and APIs to ensure customers can easily deploy, monitor and shut down their cloud-based cluster," Kelly noted. "In addition, [the] Windows Azure HDInsight Service integrates with our business intelligence tools including Excel, PowerPivot and Power View, allowing customers to easily analyze and interpret their data to garner valuable insights for their organization."
Among the first to test HDInsight is Ascribe Ltd., a U.K.-based Microsoft partner that provides health care management systems for hospitals and large medical practices. Its solution handles the lifecycle of patient care using key new components of the Microsoft portfolio -- including Windows 8-based tablets, SQL Server 2012 and HDInsight -- to perform trending analysis using anonymous patient data.
Paul Henderson, Ascribe head of business intelligence, demonstrated the application at the GigaOM Structure:Data conference in New York in March. "Rather than building our own server farm or incurring huge capital costs, HDInsight provides us with the ability to process that volume of stuff at scale, and that's a huge benefit," said Henderson in an interview.
Scores of players are now talking up new ways of capturing, analyzing and processing huge amounts of data. While the biggest alternatives to SQL Server were RDBMSes from Oracle, IBM, Teradata Corp. and, more recently, MySQL, now there are a vast number of players looking to offer modern alternatives to traditional SQL database stores.
EMC-VMware's Pivotal Move
One major entry spawned last month when VMware Inc. and its corporate parent EMC Corp. spun out a key application infrastructure and big data and analytics portfolio to a new venture called Pivotal Inc., which is now headed by former VMware CEO Paul Maritz. Just as EMC saw the opportunity to create VMware as an independent entity, the company is taking a similar strategy with Pivotal, Maritz said at a presentation for investors in New York on March 13.
"There's a large market to go after here," Maritz said, noting the core assets brought into Pivotal from EMC and VMware, including its Greenplum analytics platform that's now Hadoop-based; the Cetas real-time analytics engine; GemFire, a cloud-based data management platform for high-speed transaction processing that's often used in trading applications; Cloud Foundry; the Java-based Spring Framework; and Pivotal Labs, the destination of many customers looking to take business problems from concept to a deliverable application.
Now a $300 million business, Maritz believes Pivotal can grow to $1 billion in revenues by 2017. EMC and VMware are arming the venture with a $400 million investment and technology under development for several years with more than 100 engineers. "We're moving to where the puck is going," Martitz said.
The first key deliverable, Pivotal HD, surfaced last month. Based on its own Hadoop distribution, Pivotal HD is aiming to expand the capabilities of the store with HAWQ, a high-performance Hadoop-based RDBMS. It offers a Command Center to manage HDFS; HAWQ and MapReduce, as well as its Integrated Configuration Management (ICM) tool, to administer Hadoop clusters; and Spring Hadoop, which ties it into the company's Java-based Spring Framework. It also includes Spring Batch, which simplifies job management and execution.
Experts say Pivotal HD could put pressure on the leading Hadoop distributors Cloudera Inc., MapR Technologies Inc. and Hortonworks "because you have this very robust, proven MPP [massively parallel processing] SQL-compliant engine suddenly sitting on top of Hadoop and HDFS," says George Mathew, president and COO of Alteryx Inc., a San Mateo, Calif.-based provider of connectors that enable organizations to create dashboards for big data time analysis.
One of the reasons EMC and VMware are spinning out Pivotal is to give the company the freedom to focus on all cloud environments, including the widely used Amazon Web Services (AWS).
For its part, AWS recently launched its own cloud-based, data-warehousing platform called Redshift. Early indications are that many customers are considering Redshift because it offers a much lower cost of entry than incumbent data-warehouse providers, says Darren Cunningham, VP of cloud marketing at Informatica Corp., which itself recently released a connector that links Redshift to existing data stores.
NoSQL Enables Cloud Alternatives
Along with Redshift, a growing number of customers are using various NoSQL databases -- those that can store and process both SQL and unstructured data with much higher levels of availability than traditional ones -- which lend themselves to cloud deployments. "The majors, as we may call them -- Amazon, Google and Microsoft -- all have multiple plays going on in the cloud database world," noted Blue Badge Insights analyst Andrew Brust, during a cloud database panel at the Structure:Data conference.
Many customers are running their cloud-based apps on the open source MongoDB, the most popular purveyor of which is 10Gen Inc. A number of alternative highly available databases include those from Basho Technologies Inc., which offers the open source, distributed NoSQL database Riak (and Amazon Simple Storage Service [S3]-compatible Riak Cloud Storage [CS] platform); NuoDB Inc., which in January launched Starlings, based on what it describes as unique technology (and patents) that addresses the issue of scaling out while also supporting traditional SQL commands and reliable Atomicity, Consistency, Isolation, and Durability (ACID) transactions supporting both structured and non-SQL models; and DataStax, a leading distributor of the highly available database architecture based on Apache Cassandra.
Cassandra Targets High Availability
The appeal of Cassandra is that it's fully distributed. There's no reliance on a master replica that can go down or create a bottleneck. That allows for continuous availability, where every single node is available for full reads and writes directly. By not requiring a master, Cassandra can failover much faster. "That means [users] can hit any node at any time, with response times of 50 ms or less," says DataStax CEO Billy Bosworth.
"When people come to us it's because they want a database that's always available, meaning it's not tied to any master-replication strategy," Bosworth adds. "Users who come to us have an online application that they never want to think about being down." Does that mean Cassandra is the death knell for Oracle, IBM DB2 and Microsoft SQL Server, among others?
"I don't see the role of the relational database going away," Bosworth says. "It's a $16 billion market, [and those] don't just fall off cliffs -- but they will be used for a smaller percentage of the workload in the application architecture."
Incumbents vs. New Players
Indeed, despite the growing number of players and approaches, Blue Badge Insights' Brust believes many customers will look for the mainstream providers to embrace them. "We're seeing specialized products from specialized companies doing things that the major databases have glossed over," Brust said. "That's great, but when it's going to really become actionable for companies is when the mega-vendors either implement thisstuff themselves or do some acquisitions and bring these capabilities into their mainstream databases that have the huge installed bases. Then it becomes a lot more approachable to enterprise companies."
Noted cloud and database analyst David Linthicum, also on the Structure:Data cloud database panel, was more skeptical. "It pushes them to be more innovative, but I haven't seen much innovativeness come out of these larger database organizations in the last couple of years," Linthicum said.
Microsoft's In-Memory Plan
Microsoft isn't ceding the market to upstarts. Its Windows Azure Table Storage Service is designed to support large volumes of data while offering more-efficient access and persistence. Microsoft is also addressing growing demand for in-memory databases, made popular last year by SAP with HANA. In-memory databases can perform queries much faster than those written to disk. Microsoft revealed its plans to add in-memory capabilities to the next release of SQL Server, code-named "Hekaton," at the SQL PASS Summit back in November 2012.
"This is a separate engine that's in the same product in a single database and will have tables optimized for either the conventional engine or the in-memory engine," Brust said of Hekaton. "You can join between them so you're going more toward an abstraction."
But with a growing number of players looking to offer new types of data repositories, Microsoft is now in a more crowded field. While Microsoft has broadened its data-management portfolio with SQL Azure and now HDInsight, the requirement to find, process and analyze new types of information is greater than ever. Looking forward, all eyes will be on Hekaton and Microsoft's ability to deliver new levels of performance to SQL Server.