Setting the Data Strategy for Your Growing Organization

CHAPTER 2

What technology should we use to store and analyze our data?

Once you have a data pipeline, you need to figure out where you’re going to store and analyze all that data. You need a data warehouse. A data warehouse is a central repository of integrated data from multiple disparate sources used for reporting and analysis.

Your data warehouse will become the single source of truth for your organization. All organizational data will be loaded into this warehouse and it will power all reporting and analysis. As such, it’s incredibly important to choose the right technology to use as your foundation.

What you should be looking for in a data warehouse

  1. Performance. Your warehouse needs to able to ingest data and analyze enormous quantities of data extremely quickly.

  2. Scalability. As you grow, more data will be piped in and more users will need to run analyses. Your warehouse should be able to keep pace with your growth.

  3. Compatibility. SQL is the most widely-used query interface with a massive ecosystem of both users and tools. SQL compatibility should be considered a top priority for any warehouse technology.

  4. Analytic functionality. Analysts often need to perform more complicated calculations than are supported in traditional SQL syntax, including regressions and predictive analytics.

An introduction to analytic databases

Data warehouse technology has advanced significantly in just the past few years. An entire category called analytic databases has arisen to specifically address the needs of organizations who want to build very high-performance data warehouses. Analytic databases are purpose-built to analyze extremely large volumes of data very quickly and often perform 100-1,000 times faster than transactional databases in these tasks.

Want to keep reading?

Ready to learn more about setting the data strategy for your growing organization? The full guide contains six chapters and over 10,000 words on everything you need to know about building a data-driven company.

It wasn’t until 2004’s founding of Vertica that a modern analytic database came into being. Since then, the market has exploded. Venture-backed technology companies have led the way, including Vertica, ParAccel, Greenplum, Teradata and others. Most of these startups have been acquired by leading enterprise technology companies, notably HP and Oracle.

This history lesson is important for a couple of reasons. First, this technology is still new and not yet widely adopted. If this is the first time you’re hearing about some of these analytic databases, you’re not alone. Second, these companies have been very focused on selling to large enterprises, making it difficult for growing companies to access this technology.

The advent of the cloud analytic database

For growing companies, cloud computing provides an extremely attractive way to purchase and deploy technology solutions. With no fixed up-front costs and an ability to spin up additional resources as demand grows, cloud computing naturally fits the needs of these cash-sensitive, quickly-scaling businesses. The use of analytic databases within these companies didn’t take off until 2012, when Amazon released Redshift.

From a performance standpoint, Redshift is head-to-head competitive with the other analytic databases in the market. It can be purchased for as little as $100 or so per month and is provisioned completely online. This allows companies to bypass the complex process of purchasing and configuring their own hardware to get up and running, and has allowed Redshift to gain prominence in the market extremely quickly. Amazon unofficially reports that it is the fastest-growing service ever released as a part of AWS.

Today, Amazon Redshift is the dominant player in cloud analytic databases, but there are other startups attacking the space as well. Snowflake and Bityota are two prominent competitors, although both are still in early phases of their growth. We anticipate that the cloud delivery model for analytic databases will become more prevalent and diverse as the market matures.

What’s under the hood of an analytic database

How exactly do analytic databases deliver this 100-1,000x performance improvement for analytical query processing? Without getting too technical, there are a few key traits that allow analytic databases to perform their stunning feats of speed.

Columnar data storage. An analytic database has a column-based structure, where each column of data is stored in its own file, and organized within star or snowflake schemas. This is a highly flexible design that facilitates operating on a large set of data points within a given column very quickly. Transactional databases rely instead on row-based data storage. They’re great for operating quickly on a single row, but a row-based design can’t scale to handle large volumes of data the way a columnar design can.

Efficient data compression. As a direct result of its columnar data storage design, an analytic database is able to execute the most efficient form of data compression for each particular column of data. Data compression is the most important determinant of how much space your data will take up, and how quickly you’ll be able to move it around. Analytic databases excel at this.

Distributed workloads. In a distributed system, your data is stored on a cluster of servers (often referred to as “nodes”). For example, in a RedShift warehouse, your data might be stored across 4-18 different servers, and Redshift coordinates the parallel processing of your analytical queries across all of them. This parallelization allows for efficient processing of increasingly large volumes of data. Analytic databases are typically built with distributed processing at their core.

In-warehouse math

Many databases can perform basic arithmetic. If you want to add, subtract, multiply, or divide, you’re not going to have any problems regardless of your database platform. And most transactional databases will go further, supporting basic statistical functions like variance and standard deviation. But that’s when you hit the end of the road for the math capabilities of a transactional database.

Historically, if you wanted to perform more sophisticated mathematical analysis, you had to export raw data from your database and import it into a statistical tool like STATA, SAS, or R for analysis. But in today’s world of huge datasets, that model breaks down. It’s challenging, if not impossible, to export terabytes of data to be analyzed in these kinds of statistical tools. Analytic databases are rushing to address this problem, enabling users to perform ever more sophisticated statistical and mathematical analysis within the database itself.

Examples of this include basic trigonometric, logarithmic, and exponential functions, statistical functions like covariance and linear regression, and geospatial functions to determine the distance between two coordinates. Some platforms also provide analysts the ability to write their own functions in a low-level language. These functions can then be run natively within the analytic database, with all of the attendant advantages of scalability and performance.

The advent of sophisticated mathematical functions delivered in a high-performance analytic database is in its infancy. We believe this will continue to be more impactful as analysts and analysis tools find innovative ways to use this functionality. Whether or not you see an obvious need for this today, we believe you will in the near future. The applications are just too significant: performing statistical tests, calculating linear distances, and running regressions all have incredibly powerful applications in business analytics.

What about Hadoop?

We’ve talked extensively about the benefits of analytic databases for analyzing large datasets, but there’s an elephant in the room. How does Hadoop fit into this picture?

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. It has two primary components: HDFS, a distributed file system, and MapReduce, a system for parallel processing of large datasets in HDFS.

Hadoop is a critical component of the data stack for many organizations, but it’s not an appropriate choice of technology for your organization’s data warehouse. Its use cases tend to be more heavily statistical and algorithmic, and more focused on data science than business analytics. Here are a few examples of where Hadoop might be a better choice than an analytic database:

  • Recommendations. Recommendation algorithms, from movies to music to ecommerce, are commonly powered by algorithms running on top of Hadoop.

  • Classifications. From facial recognition to recognizing what song is playing on the radio (a la Shazam), classification algorithms are a popular usage of Hadoop.

  • Search. The precursor of Hadoop was actually built at Google to power Google’s web search algorithm. Hadoop today is still used in many search applications.

These are just a few examples, and they’re all incredibly important applications of data analysis. What’s most important in this list of examples, though, are the things that aren’t included. Hadoop is not commonly used for the types of tasks we’re primarily interested in—analysis and visualization of data about businesses and their customers. For those tasks, analytic databases are not only more usable because of their SQL-based interface, they’re actually far higher performance.

If processing data in Hadoop is a priority for your organization, it will be important that your data pipeline output data to both a data warehouse and to HDFS, as depicted below. If you’re just getting off the ground, we’d recommend steering clear of Hadoop until you find a clear and compelling need. Until then, Hadoop will just be a distraction.

How Hadoop fits into your data pipeline

Your clear choice

Often, we will recommend that there are many potential solutions, and that based on your organizational needs, you should choose the most appropriate tool for your situation. That is not the case here. We believe that all companies, large or small, should be relying on an analytic database for their data warehouse. The benefits are just too large:

  • Dramatically better analytical query performance than transactional databases
  • Horizontal scalability
  • SQL compatibility
  • Advanced math and statistical functionality

And with the advent of cloud-based analytic databases, every company, regardless of budget, can afford access to this technology. We recommend that companies focus on whether they’d prefer to manage on-premise hardware and software, or leave their entire infrastructure in the cloud; your choice of platform will flow from that decision. Ultimately, the differences between Vertica, Redshift, Greenplum, and others are not that significant for most use cases. What’s most important is that a) you have a data warehouse, and b) that you choose an analytic, not a transactional, database technology to power it.

← Previous Chapter

How should we consolidate our disparate data sources?

Next Up →

How should we facilitate data exploration?

Chapter 2
What technology should we use to store and analyze our data?