Amazon Redshift and Google BigQuery are the Coke and Pepsi of data warehouses: two comparable fully managed petabyte-scale cloud data warehouses. They’re similar in many ways, but anyone who’s comparing cloud data warehouses should consider how their unique features can contribute to an organization’s data analytics infrastructure.
Amazon Redshift overview
As defined by Amazon, “Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.”
The technology on which Redshift is based was licensed from ParAccel Analytic Database in 2012, which in turn was loosely based on PostgreSQL.
More than 10,000 companies use Redshift, according to HG Insights. Customers include well-known brands such as McDonald’s, Pfizer, and Lyft.
Google BigQuery overview
“BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in,” according to Google.
BigQuery is an externalized version of an internal tool, Dremel, a query system for analysis of read-only nested data that Google developed in 2006. The company released BigQuery in 2012 to provide a core set of features available in Dremel to third-party developers. Dremel uses SQL-like queries, while BigQuery uses ANSI-compliant SQL.
More than 3,000 companies use BigQuery, including The New York Times, HSBC Bank, and Spotify, according to HG Insights.
How to benchmark data warehouse performance
Performance is a key differentiating feature for any data warehouse. Benchmarking data warehouse performance typically begins with two questions:
- What types of queries will we run?
- What’s the appropriate test environment?
The questions sound simple, but getting useful answers can be a complex process.
In most organizations, multiple business units query the data warehouse. Ideally, evaluators would define and test the types of queries likely to be run by all business units. This may be unrealistic, however, so some test queries may serve as a representative sample of a larger set.
The test environment should mirror the production environment. Both Redshift and BigQuery offer free trial periods during with customers can evaluate performance, but they impose limits on available resources during trials.
Redshift provides 750 hours per month for two months for free, during which businesses can continuously run one DC2.Large node with 160GB of compressed SSD storage.
BigQuery has two free tiers: one for storage (10GB) and one for analysis (1TB/month). Both are for new users only.
The free trial limits may not allow organizations to simulate a production environment, but the limited resources available during a trial can still be useful for comparative benchmarking.
It may also be helpful to look at third-party benchmark results, but keep in mind that what performed well in a third-party benchmark test may not perform well in your environment. Benchmark testing provides useful information, but it’s most useful if you’re doing your own testing. If you can’t run your own, look for benchmark test results that closely mirror your environment.
Features: Redshift vs. BigQuery
When comparing Redshift and BigQuery, the devil is in the details – the features, capabilities, and underlying architecture. Think about the following features when evaluating Redshift and BigQuery.
|Platform||Amazon Web Services||Google Cloud Platform|
|Infrastructure||Provisioned clusters and nodes||Serverless|
|Availability (regions per area)||Americas – 8
Europe – 5
Asia Pacific – 9
|Americas – 4
Europe – 3
Asia Pacific – 7
|Updating tables||Staging table||Data Manipulation Language|
|Table column limits||1,600 columns||10,000 columns|
|Streaming data ingestion||Must use Amazon Kinesis Firehose||Supported|
|Manageability / usability||User configures infrastructure, periodic management tasks required||No configuration necessary|
|Security||Inherits security features of Amazon Web Services
Encryption of data must be enabled
AWS data loss prevention (DLP) service, Macie, does not support Redshift
|Inherits security features of Google Cloud Platform
Encrypts data by default
Google Cloud DLP service supports BigQuery
Amazon and Google take significantly different approaches to delivering their services. Amazon’s approach is akin to infrastructure-as-a-service (IaaS) or platform-as-a-service (PaaS). Customers have to be responsible for some capacity planning and must provision compute and storage nodes on the platform.
Google’s approach is more like software-as-a-service (SaaS). BigQuery is serverless, with no need to allocate resources. You can sign up for the service, access the web UI or REST API, and start querying.
Both data warehouses support batch data ingestion in similar ways. Streaming data is a bit more difficult to handle and each service handles it differently.
Amazon Kinesis collects, processes, and analyzes streaming data and includes the Kinesis Firehose feature to stream data into Redshift. While Kinesis Firehose is a separate service required to stream data into Redshift, BigQuery natively supports streaming data.
Columnar databases are architected to support fast analytics processing. However, that architecture makes them slow and inefficient when it comes to updates and inserts. Nevertheless, both Redshift and BigQuery can handle data modification, though they do it in different ways.
Redshift can do upserts, but only via a convoluted process. Redshift doesn’t provide an UPSERT command to update a table. Users must load data into a staging table and then join the staging table with a target table for an UPDATE statement and an INSERT statement.
BigQuery was designed as an append-only system. In 2017, Google released Data Manipulation Language (DML) to enable users to update, insert, and delete data.
Redshift allows administrators to define a maximum of 1,600 columns in a single table. In BigQuery, the limit is 10,000 columns.
Google provides a Cloud SDK that supports C#, Go, Java, Node.js, PHP, Python, and Ruby. It covers many services across GCP, including BigQuery. Google also supports a REST API to create, manage, share and query data, as well as APIs for BigQuery Data Transfer Service and BigQuery Storage.
Manageability and usability
Redshift requires administrators to configure the environment by creating nodes and clusters. An administrator also must monitor the service and allocate more resources if needed. Redshift requires periodic management tasks, such as cleaning up or vacuuming tables to remove rows that have been marked for deletion.
BigQuery, by contrast, requires no configuration. Google handles the infrastructure, automatically provisioning the needed resources behind the scenes. One simply creates a GCP project and runs a query. As the data volume grows or queries become more complex, Google automatically scales in the background to meet current needs. BigQuery’s focus on usability may allow analysts to do more without requiring the skills of a database administrator.
Both Redshift and BigQuery architectures are organized into regions and zones. Physically separate zones within each region provide high availability (HA). Redshift requires more manual configuration than BigQuery to ensure high availability, but both provide effective resiliency.
Redshift has regions in 14 countries, while BigQuery has regions in 12. Both services have one or more zones within each region. In the context of high availability, a zone should be viewed as a single point of failure — so two nodes within a single zone does not provide HA. Best practice for HA is to have a node in each of two (or more) zones within a region.
Data warehouses commonly use two security controls to protect your data: encryption and identity and access management (IAM). Encryption has to be explicitly enabled in Redshift, while BigQuery encrypts data at rest, in transit, and in use by default. Both services handle key management and offer an option for customer-managed keys.
AWS IAM and Google Cloud IAM provide role-based access controls for each platform’s suite of cloud services, including Redshift and BigQuery respectively. Even though the IAM services are not designed specifically for the data warehouses, they do provide a broad set of controls for those services. For example, one can grant a customized set of permissions for a single user to a specific cluster in Redshift or dataset in BigQuery.
Both platforms include virtual private cloud (VPC) and SSL connections.
Some platform security features are not supported by every AWS or GCP service. For example, the Amazon Macie data loss prevention (DLP) service doesn’t support Redshift. — though Google Cloud DLP does support BigQuery.
Each of these data warehouses leverages other services on the cloud platforms on which they run.
Redshift clusters run on Amazon Elastic Compute Cloud (EC2) instances. AWS parallel processing allows services to read and load data from multiple data files stored in Amazon Simple Storage Service (S3). Conversely, one can export data from Redshift to multiple data files on S3 and even extend queries to S3 without loading data into Redshift.
Amazon Glue makes it easy to ETL data from S3 to Redshift. Redshift Spectrum lets users skip the ETL process in some cases by querying directly against data in S3. Amazon Kinesis collects, processes, and analyzes streaming data, while Kinesis Firehose streams data into Redshift. The AWS Database Migration Service simplifies to process of loading data from on-prem sources to Redshift.
If you use Google Cloud Platform services, you can capture source data in Cloud Bigtable, Cloud Datastore, or Cloud Spanner and ETL data into BigQuery using Cloud Dataflow.
Business intelligence tools
You can use either of these data warehouses with any of dozens of analytics and BI tools, but each cloud platform also offers its own analytics tool.
Amazon QuickSight integrates with Redshift and provides BI insights through interactive dashboards.
BigQuery has its own BI engine that provides subsecond query responses. It integrates with Google Data Studio to let users build interactive dashboards.
Redshift and BigQuery offer significantly different pricing options, so there’s no apples-to-apples comparison. Both have on-demand, pay-as-you-go pricing, as well as some form of flat-rate pricing.
Redshift charges for each node in a cluster. Node prices vary based on technical parameters and region. For example, a node in California can cost 50% more than a similarly configured node in Oregon. On-demand pricing starts at $0.25 per hour. Flat-rate pricing starts at $1,380 for one year or $2,465 for three years. Additional charges include Redshift Spectrum to handle streaming data, backup storage beyond the provisioned storage size, AWS Key Management Service encryption, and data transfers other than those between Redshift and S3.
BigQuery charges for queries, storage, and streaming inserts. Loading, copying, or exporting data is free. You can pay as you go for queries at $5 per TB or use flat-rate pricing starting at $10,000 per month. Storage is $0.02 per GB with the first 10GB free. You pay half that much for long-term storage if a table has not been edited for 90 days. Charges that can sneak up on you include Storage API usage, streaming data, Data Manipulation Language usage, and data definition language usage.
Redshift or BigQuery: Which is better for you?
If Amazon Redshift and Google BigQuery are the Coke and Pepsi of data warehouses, then the one that wins the taste test should be the one that works best in your environment to meet your specific business needs.
Redshift and BigQuery have many similarities, but also important differences that can tip the scales. Are you heavily invested in AWS? Redshift may be a good choice for you. Are your database administrator resources limited? BigQuery may be a better choice. Do you have tables with more than 1,600 columns? Better you than me — go with BigQuery.
Whichever data warehouse you choose, chances are you’re going to fill it with data from sources outside of the AWS or GCP platform. When that time comes, you should leverage third-party tools like Stitch that offer ETL services for cloud data warehouses to save resources and reduce the time to business insights. Stitch offers a free trial, so you can test it with your own data sources and the data warehouse of your choice.