What technology should we use to store and analyze our data?
Once you have a data pipeline, you need to figure out where you’re going to store and analyze all that data. You need a data warehouse. A data warehouse is a central repository of integrated data from multiple disparate sources used for reporting and analysis.
Your data warehouse will become the single source of truth for your organization. All organizational data will be loaded into this warehouse and it will power all reporting and analysis. As such, it’s incredibly important to choose the right technology to use as your foundation.
What you should be looking for in a data warehouse
- Performance. Your warehouse needs to able to ingest data and analyze enormous quantities of data extremely quickly.
- Scalability. As you grow, more data will be piped in and more users will need to run analyses. Your warehouse should be able to keep pace with your growth.
- Compatibility. SQL is the most widely-used query interface with a massive ecosystem of both users and tools. SQL compatibility should be considered a top priority for any warehouse technology.
- Analytic functionality. Analysts often need to perform more complicated calculations than are supported in traditional SQL syntax, including regressions and predictive analytics.
An introduction to analytic databases
Data warehouse technology has advanced significantly in just the past few years. An entire category called analytic databases has arisen to specifically address the needs of organizations who want to build very high-performance data warehouses. Analytic databases are purpose-built to analyze extremely large volumes of data very quickly and often perform 100-1,000 times faster than transactional databases in these tasks.
It wasn’t until 2004’s founding of Vertica that a modern analytic database came into being. Since then, the market has exploded. Venture-backed technology companies have led the way, including Vertica, ParAccel, Greenplum, Teradata and others. Most of these startups have been acquired by leading enterprise technology companies, notably HP and Oracle.
The advent of the cloud analytic database
The advent of cloud computing changed the analytic database landscape. Cloud computing is an attractive way to purchase and deploy technology solutions. With no fixed up-front costs and an ability to spin up additional resources as demand grows, cloud computing fits the needs of cash-sensitive, quickly-scaling businesses.
Amazon released Redshift, the first cloud analytic database, in 2012. It can be deployed for as little as $100 or so per month and is provisioned completely online, which allows companies to avoid capital expenditures and the complex process of installing, configuring, and maintaining their own hardware.
The cloud delivery model for analytic databases is now the default solution. Redshift is the dominant player in cloud analytic databases, but Snowflake, Google BigQuery, and Microsoft Azure Synapse are prominent competitors.
What’s under the hood of an analytic database
How exactly do analytic databases deliver this 100-1,000x performance improvement for analytical query processing? Without getting too technical, there are a few key traits that allow analytic databases to perform their stunning feats of speed.
Columnar data storage. An analytic database has a column-based structure, where each column of data is stored in its own file, and organized within star or snowflake schemas. This is a highly flexible design that facilitates operating on a large set of data points within a given column very quickly. Transactional databases rely instead on row-based data storage. They’re great for operating quickly on a single row, but a row-based design can’t scale to handle large volumes of data the way a columnar design can.
Efficient data compression. As a direct result of its columnar data storage design, an analytic database is able to execute the most efficient form of data compression for each particular column of data. Data compression is the most important determinant of how much space your data will take up, and how quickly you’ll be able to move it around. Analytic databases excel at this.
Distributed workloads. In a distributed system, your data is stored on a cluster of servers (often referred to as “nodes”). For example, in a RedShift warehouse, your data might be stored across 4-18 different servers, and Redshift coordinates the parallel processing of your analytical queries across all of them. This parallelization allows for efficient processing of increasingly large volumes of data. Analytic databases are typically built with distributed processing at their core.
Many databases can perform basic arithmetic. If you want to add, subtract, multiply, or divide, you’re not going to have any problems regardless of your database platform. And most transactional databases will go further, supporting basic statistical functions like variance and standard deviation. But that’s when you hit the end of the road for the math capabilities of a transactional database.
Historically, if you wanted to perform more sophisticated mathematical analysis, you had to export raw data from your database and import it into a statistical tool like STATA, SAS, or R for analysis. But in today’s world of huge datasets, that model breaks down. It’s challenging, if not impossible, to export terabytes of data to be analyzed in these kinds of statistical tools. Analytic databases are rushing to address this problem, enabling users to perform ever more sophisticated statistical and mathematical analysis within the database itself.
Examples of this include basic trigonometric, logarithmic, and exponential functions, statistical functions like covariance and linear regression, and geospatial functions to determine the distance between two coordinates. Some platforms also provide analysts the ability to write their own functions in a low-level language. These functions can then be run natively within the analytic database, with all of the attendant advantages of scalability and performance.
Performing statistical tests, calculating linear distances, and running regressions all have powerful applications in business analytics.
What about Hadoop?
We’ve talked extensively about the benefits of analytic databases for analyzing large datasets, but there’s an elephant in the room. How does Hadoop fit into this picture?
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. It has two primary components: HDFS, a distributed file system, and MapReduce, a system for parallel processing of large datasets in HDFS.
Hadoop is a critical component of the data stack for many organizations, but it’s not an appropriate choice of technology for your organization’s data warehouse. Its use cases tend to be more heavily statistical and algorithmic, and more focused on data science than business analytics. Here are a few examples of where Hadoop might be a better choice than an analytic database:
- Recommendations. Recommendation algorithms, from movies to music to ecommerce, are commonly powered by algorithms running on top of Hadoop.
- Classifications. From facial recognition to recognizing what song is playing on the radio (a la Shazam), classification algorithms are a popular usage of Hadoop.
- Search. The precursor of Hadoop was actually built at Google to power Google’s web search algorithm. Hadoop today is still used in many search applications.
These are just a few examples, and they’re all incredibly important applications of data analysis. What’s most important in this list of examples, though, are the things that aren’t included. Hadoop is not commonly used for the types of tasks we’re primarily interested in—analysis and visualization of data about businesses and their customers. For those tasks, analytic databases are not only more usable because of their SQL-based interface, they’re actually far higher performance.
If processing data in Hadoop is a priority for your organization, it will be important that your data pipeline output data to both a data warehouse and to HDFS, as depicted below. If you’re just getting off the ground, we’d recommend steering clear of Hadoop until you find a clear and compelling need. Until then, Hadoop will just be a distraction.
How Hadoop fits into your data pipeline
Your clear choice
Often an organization has many potential solutions to an engineering challenge. That's not the case here. All companies, large or small, should rely on an analytic database for their data warehouse. The benefits are just too big:
- Dramatically better analytical query performance than transactional databases
- Horizontal scalability
- SQL compatibility
- Advanced math and statistical functionality
And with the advent of cloud-based analytic databases, every company, regardless of budget, can afford access to this technology. We recommend that companies focus on whether they’d prefer to manage on-premise hardware and software, or leave their entire infrastructure in the cloud; your choice of platform will flow from that decision. Ultimately, the differences between Vertica, Redshift, Greenplum, and others are not that significant for most use cases. What’s most important is that a) you have a data warehouse, and b) that you choose an analytic, not a transactional, database technology to power it.