Enterprises around the world are taking advantage of cloud data warehouses as a base for business intelligence and other data-driven initiatives. Having a cloud-based data warehouse eliminates the cost of maintaining internal hardware and infrastructure and yields performance and cost efficiency improvements due to the cloud's inherent scalability. Top cloud data warehouses include Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse, and Snowflake.
Choosing a provider for a cloud data warehouse can be difficult due to the large number of factors that can influence the decision. It pays to know the most popular cloud data warehouses' key features and the criteria to use when evaluating them.
While technical details and costs differ among them, the major cloud data warehouse providers share a number of features and core capabilities:
While these data warehouses have historically high uptime, they're not immune to outages or failures. Internet attacks and human error periodically cause downtime and render a warehouse unavailable. A data warehouse is a potential single point of failure in an organization's data-driven processes, so reliability is paramount. All of the major cloud data warehouse providers offer reliability features, such as data replication across data centers and regions.
A key feature of the cloud is the flexibility and scalability of storage and compute resources. Massive parallel processing (MPP) systems power the major cloud data warehouse services, allowing scaling up when compute demands are high and down to reduce costs when demands are lower.
All of the data in a cloud data warehouse is stored in columnar format rather than by row. This optimizes for compression and minimizes the number of disk seeks on queries.
Several major cloud data warehouse solutions can meet enterprises' needs for cloud computing and big data processing.
Redshift is a fast, scalable data warehouse service in the Amazon Web Services (AWS) ecosystem. With more than 15,000 customers using the service, Redshift has the largest number of deployments of the major providers.
Redshift can support big data and optimizes for speed and high levels of concurrency. It often wins benchmarks for performance and cost efficiency on standard data analytics workflows, although third-party benchmark tests are no guarantee of superior performance in your enterprise's environment.
BigQuery is the Google Cloud Platform offering for storing, processing, and analyzing large datasets. With an emphasis on seamless scalability, BigQuery can allocate tens of thousands of cores to a deployment in seconds. Unlike RedShift, BigQuery is serverless and thus provides services dynamically, requiring no upfront provisioning or managing of hardware.
BigQuery also maintains a high level of simplicity and abstraction. The service exposes a SQL interface while automating many complex back-end operations such as scaling of compute resources and data replication across availability zones. It provides encryption of data at rest and in transit by default and requires no configuration to get running.
While Microsoft may be more known for their SQL Server database, they offer a competitive cloud data warehouse platform. Azure SQL Data Warehouse offers a scalable and cost-effective system for processing large datasets with familiar syntax and interfaces.
Azure's data warehouse provides more control over indexing than its competitors. While using this feature requires more technical knowledge to optimize the underlying data structures, it allows IT departments with strong SQL skills to better manage their systems.
Snowflake can be run on multiple cloud providers — Amazon Web Services, Microsoft Azure, and Google Cloud Platform. Snowflake provides flexibility in its pricing by charging for compute and storage separately and offering several sizes of data warehouse resources.
Snowflake shines in its plug-and-play ability. It automates many menial data maintenance tasks such as updating metadata and vacuuming, and it automatically enables a number of security features.
Though they share similarities, the major cloud data warehouses have significant differences that your enterprise needs to understand before deciding between them. Consider these five factors when evaluating which platform best fits your business needs:
While cloud data warehouses are built to be generalizable across different industries and business departments, you should carefully consider how you plan to use yours, as the factors for evaluating the providers can vary based on the use case and your enterprise’s unique circumstances.
For example, an enterprise that intends to work with JSON in its data warehouse may prefer Snowflake to Redshift, as Snowflake natively supports storing and querying JSON. Organizations that lack a dedicated data administrator may want to choose a warehouse other than Redshift, as it requires significant configuration and regular monitoring.
Your enterprise must choose a cloud data warehouse that supports the level and type of security it needs. While all of the major vendors keep their security systems up to date and patch vulnerabilities, the systems' configurations and defaults vary. For example, encryption is handled differently among the major cloud data warehouses. BigQuery encrypts data at rest and in transit by default, while Redshift requires database encryption to be explicitly enabled. Consider factors such as key management and access control as well.
Cloud data warehouses vary with regard to data requirements and assumptions. Snowflake, for instance, supports semistructured data in structures such as Object or Array, while Redshift does not. As a result, Redshift can make stronger assumptions about the structure of the data (such as patterns in the data type, which affects the choice of compression schemes) and better optimize its storage speed, but data retrieval speed could be slower due to Redshift’s lack of support for denormalized data structures.
The level of flexibility required by an enterprise should determine which approach fits best. For instance, if an organization has to store data where the structure is not necessarily predefined, Snowflake’s looser structure may be beneficial.
The major cloud data warehouses bundle resources and calculate costs in different ways. For instance, Redshift bundles storage and compute resources together. This leads to simple pricing, but it also forces users to accept predefined instance type values for memory, storage, and I/O.
BigQuery has a more granular pricing structure and charges for bytes read, storage, and streaming inserts. Unlike Redshift, it does not charge for hardware resources. Its costs are therefore less predictable, as they're primarily a function of bytes read in queries, so it can be hard to predict usage accurately.
Azure's data warehouse bundles the lower-level technical factors of cost related to compute, such as logical CPU cores and I/O, into a “Data Warehouse Unit” (DWU). Cost calculations then become a function of storage and DWUs. Users can pause DWU usage, and charges then accrue only for storage.
Similarly, Snowflake also abstracts physical resources into credits, which increase proportionally with the number of virtual warehouses and the amount of resources within each. A virtual warehouse is a cluster of machines that execute queries, load data, and perform other data manipulation operations. Storage is separate and billed per terabyte per month.
Most cloud data warehouse services also provide flat-rate pricing. For instance, Redshift has a pricing model called Reserved Instances that provides discounts if an organization commits to and pays for resource usage for a year or more. Reserved Instances allow businesses with large deployments to manage their costs as usage becomes more predictable.
While it's difficult to estimate costs with much accuracy before you begin using a data warehouse, an analysis of expected workflows can help you get close. Ask relevant questions, such as:
These questions can help you calculate expected utilization so that you can compare the providers and make an informed decision.
The major vendors all provide free trials. Consider using these to determine a rough estimate of what costs will look like at scale.
An organization should also take into account where existing data and applications reside. For example, if the majority of data in a system is already in S3, using Redshift or Snowflake on AWS could lead to performance gains due to physical data locality. Even if you have to sync data from compute and storage resources in different availability zones, staying within AWS ensures the data transfer paths are on highly optimized infrastructure rather than having to traverse the public Internet.
However, any difference in performance may only be marginal, and a latency difference of tens of milliseconds on data import will rarely have a material impact on real-world workflows.
Choosing the best cloud data warehouse for your business can be overwhelming, as many variables can impact the successful deployment of a system. Despite this, by considering expected use cases and workflows, an enterprise can evaluate the relevant factors and select the warehouse that best fits its needs.
Stitch can help your organization move your data simply and quickly to your preferred cloud data warehouse, whether that's Redshift, BigQuery, Azure, or Snowflake. Unlock the advantages of cloud data warehouses with Stitch today.