Amazon Redshift delivers lightning-fast performance and scalable data processing solutions without a massive investment in infrastructure. Redshift also provides access to a variety of data analytics tools, compliance features, and even artificial intelligence and machine learning applications.
But you already knew that.
What you might not know is that Redshift differs from traditional data warehouses in several critical areas. In this article, we’ll take a look at six features that set Redshift apart from other cloud data warehouses, and how these features can help you make the most of your Redshift data.
Before we get to that, let’s consider what makes Redshift and conventional data warehouses such different animals.
Redshift vs. traditional data warehouses
Data warehouses provide the storage and analytics capacity needed to drive business intelligence. In turn, the business insights gleaned from the data in our warehouses help us optimize our operations, grow revenue, and improve our marketing strategies.
Despite the benefits they promise to businesses, traditional data warehouses require massive investments of time, money, and expertise. On-premises servers must be built, managed, and maintained, and many companies find this scenario to be inefficient, wasteful, and expensive.
Redshift helps companies overcome this obstacle by providing a cloud-based suite of data management, processing, and analytics tools. In this scenario, Redshift provides a complete data warehouse infrastructure. Companies then rent the data processing and compute resources they need from Amazon.
Businesses capture two primary benefits from this model. First, they avoid the expense of building and maintaining a local infrastructure. Second, Redshift services can be scaled to meet demand, so companies only pay for the capacity they need at a given point in time.
As a result, Redshift provides a degree of agility and efficiency not possible with other types of data warehouses or infrastructures.
6 essential features of Redshift
Redshift is known for its emphasis on continuous innovation, but it’s the platform’s architecture that has made it one of the most powerful cloud data warehouse solutions. Here are the six features of that architecture that help Redshift stand out from other data warehouses.
1. Column-oriented databases
Data can be organized either into rows or columns. What determines the type of method is the nature of the workload.
The most common system of organizing data is by row. That’s because row-oriented systems are designed to quickly process a large number of small operations. This is known as online transaction processing, or OLTP, and is used by most operational databases.
In contrast, column-oriented databases allow for increased speed when it comes to accessing large amounts of data. For example, in an online analytical processing—or OLAP—environment such as Redshift, users generally apply a smaller number of queries to much larger datasets. In this scenario, being a column-oriented database allows Redshift to complete massive data processing jobs quickly.
2. Massively parallel processing (MPP)
MPP is a distributed design approach in which several processors apply a “divide and conquer” strategy to large data jobs. A large processing job is organized into smaller jobs which are then distributed among a cluster of processors (compute nodes). The processors complete their computations simultaneously rather than sequentially. The result is a large reduction in the amount of time Redshift needs to complete a single, massive job.
3. End-to-end data encryption
No business or organization is exempt from data privacy and security regulations, and encryption is one of the pillars of data protection. This is especially true when it comes to complying with laws such as GDPR, HIPAA, the Sarbanes-Oxley Act, and the California Privacy Act.
Encryption options in Redshift are robust and highly customizable. This flexibility allows users to configure an encryption standard that best fits their needs. Redshift security encryption features include:
- The option of employing either an AWS-managed or a customer-managed key
- Migrating data between encrypted and unencrypted clusters
- A choice between AWS Key Management Service or HSM (hardware security module)
- Options to apply single or double encryption, depending on the scenario
4. Network isolation
For businesses that want additional security, administrators can choose to isolate their network within Redshift. In this scenario, network access to an organization’s cluster(s) is restricted by enabling the Amazon VPC. The user’s data warehouse remains connected to the existing IT infrastructure with IPsec VPN.
5. Fault tolerance
Fault tolerance refers to the ability of a system to continue functioning even when some components fail. When it comes to data warehousing, fault tolerance determines the capacity for a job to continue being run when some processors or clusters are offline.
Data accessibility and data warehouse reliability are paramount for any user. AWS monitors its clusters around the clock. When drives, nodes, or clusters fail, Redshift automatically re-replicates data and shifts data to healthy nodes.
6. Concurrency limits
Concurrency limits determine the maximum number of nodes or clusters that a user can provision at any given time. These limits ensure that adequate compute resources are available to all users. In this sense, concurrency limits democratize the data warehouse.
Redshift maintains concurrency limits that are similar to other data warehouses, but with a degree of flexibility. For example, the number of nodes that are available per cluster is determined by the cluster’s node type. Redshift also configures limits based on regions, rather than applying a single limit to all users. Finally, in some situations, users may submit a limit increase request.
ETL with Redshift
Now that we’ve covered the distinguishing features of Amazon Redshift, let’s talk about ETL options. you need to carefully consider how the data will move in and out of the repository. People throughout your organization will be loading and reading data, so you’ll need a well-planned extract, transform, and load (ETL) process.
Many companies have relied on the conventional “lift-and-shift” approach to move their data from one platform to another. This is an in-house solution that requires time, expertise, and hand-coding. Here’s a typical example of this approach to ETL:
Custom-built ETL workflow
In a traditional ETL process, data is transferred and processed in batches from source databases to the data warehouse. Building ETL workflows with batch processing typically involves six steps:
- Reference data: The user creates a set of data that defines all permissible values.
- Data extraction: The success of your subsequent ETL steps depend on whether you extract the data correctly. Most ETL systems combine data from multiple source systems, so successful extraction requires converting data into a single, standardized format for processing.
- Data validation: In this step, the user confirms that data pulled from sources contains the expected values. For example, in a database of customer transactions from the past year, the date field should contain valid dates only within the past 12 months. Any data that fails the validation rule should be rejected. Errors discovered during this process are identified, analyzed, and resolved.
- Data transformation: Data is converted from the source format to that of its future destination.
- Data staging: Transformed data isn’t usually loaded directly into the target data warehouse. Data should first enter a staging database so that it can be rolled back in case errors arise. This is also a good time to generate audit reports for regulatory compliance or to diagnose and repair data problems before moving the data to the data warehouse.
- Publish to the data warehouse: In this step, the data is loaded to the target tables. Depending on the type of data warehouse you have, it may overwrite existing information every time the ETL pipeline loads a new batch. In other cases, ETL can add new data without overwriting — a timestamp indicates the data is new.
Building your own data pipeline can be costly in terms of time and engineering bandwidth. But it’s not the only option for moving data from your sources into Redshift. A more efficient solution can automate most of these steps to avoid the painstaking processes of hand-coding and manual extraction and loading.
The Stitch alternative
ETL is usually thought of as a three-step process. Data is extracted, transformed, and then loaded into its source destination. While this approach may be useful for some businesses, it’s not always necessary or even preferred.
An alternative to traditional ETL allows users to move their data from source to destination quickly by allowing the destination data warehouse to manage the transformation process. It’s a variation known as ELT, which puts the emphasis on extract and load.
Redshift and Stitch — The fastest way to get your data to its destination
You’ve done your homework and now it’s time to replicate your data to Redshift. You can’t wait on a long, drawn-out ETL process: you need a tool that delivers speed and reliability. That’s where Stitch can help.
Stitch makes it easy to extract data data from more than 90 sources and move it to Redshift in a flash. You’ll also have the Stitch support team on your side to help you deliver fast, dependable builds for your team.
Our approach is simple, straightforward, and ready to go right out of the box. No need to wait — set up a free trial in minutes and get your data to Redshift today.