Amazon Redshift delivers lightning-fast performance and scalable data processing solutions without a massive investment in infrastructure. Redshift also provides access to a variety of data analytics tools, compliance features, and even artificial intelligence and machine learning applications.
But you already knew that.
What you might not know is that Redshift differs from traditional data warehouses in several critical areas. In this article, we'll take a look at six features that set Redshift apart from other cloud data warehouses, and how these features can help you make the most of your Redshift data.
Before we get to that, let's consider what makes Redshift and conventional data warehouses such different animals.
Data warehouses provide the storage and analytics capacity needed to drive business intelligence. In turn, the business insights gleaned from the data in our warehouses help us optimize our operations, grow revenue, and improve our marketing strategies.
Despite the benefits they promise to businesses, traditional data warehouses require massive investments of time, money, and expertise. On-premises servers must be built, managed, and maintained, and many companies find this scenario to be inefficient, wasteful, and expensive.
Redshift helps companies overcome this obstacle by providing a cloud-based suite of data management, processing, and analytics tools. In this scenario, Redshift provides a complete data warehouse infrastructure. Companies then rent the data processing and compute resources they need from Amazon.
Businesses capture two primary benefits from this model. First, they avoid the expense of building and maintaining a local infrastructure. Second, Redshift services can be scaled to meet demand, so companies only pay for the capacity they need at a given point in time.
As a result, Redshift provides a degree of agility and efficiency not possible with other types of data warehouses or infrastructures.
Redshift is known for its emphasis on continuous innovation, but it's the platform's architecture that has made it one of the most powerful cloud data warehouse solutions. Here are the six features of that architecture that help Redshift stand out from other data warehouses.
Data can be organized either into rows or columns. What determines the type of method is the nature of the workload.
The most common system of organizing data is by row. That's because row-oriented systems are designed to quickly process a large number of small operations. This is known as online transaction processing, or OLTP, and is used by most operational databases.
In contrast, column-oriented databases allow for increased speed when it comes to accessing large amounts of data. For example, in an online analytical processing—or OLAP—environment such as Redshift, users generally apply a smaller number of queries to much larger datasets. In this scenario, being a column-oriented database allows Redshift to complete massive data processing jobs quickly.
MPP is a distributed design approach in which several processors apply a "divide and conquer" strategy to large data jobs. A large processing job is organized into smaller jobs which are then distributed among a cluster of processors (compute nodes). The processors complete their computations simultaneously rather than sequentially. The result is a large reduction in the amount of time Redshift needs to complete a single, massive job.
No business or organization is exempt from data privacy and security regulations, and encryption is one of the pillars of data protection. This is especially true when it comes to complying with laws such as GDPR, HIPAA, the Sarbanes-Oxley Act, and the California Privacy Act.
Encryption options in Redshift are robust and highly customizable. This flexibility allows users to configure an encryption standard that best fits their needs. Redshift security encryption features include:
For businesses that want additional security, administrators can choose to isolate their network within Redshift. In this scenario, network access to an organization's cluster(s) is restricted by enabling the Amazon VPC. The user's data warehouse remains connected to the existing IT infrastructure with IPsec VPN.
Fault tolerance refers to the ability of a system to continue functioning even when some components fail. When it comes to data warehousing, fault tolerance determines the capacity for a job to continue being run when some processors or clusters are offline.
Data accessibility and data warehouse reliability are paramount for any user. AWS monitors its clusters around the clock. When drives, nodes, or clusters fail, Redshift automatically re-replicates data and shifts data to healthy nodes.
Concurrency limits determine the maximum number of nodes or clusters that a user can provision at any given time. These limits ensure that adequate compute resources are available to all users. In this sense, concurrency limits democratize the data warehouse.
Redshift maintains concurrency limits that are similar to other data warehouses, but with a degree of flexibility. For example, the number of nodes that are available per cluster is determined by the cluster's node type. Redshift also configures limits based on regions, rather than applying a single limit to all users. Finally, in some situations, users may submit a limit increase request.
Now that we've covered the distinguishing features of Amazon Redshift, let's talk about ETL options. you need to carefully consider how the data will move in and out of the repository. People throughout your organization will be loading and reading data, so you'll need a well-planned extract, transform, and load (ETL) process.
Many companies have relied on the conventional "lift-and-shift" approach to move their data from one platform to another. This is an in-house solution that requires time, expertise, and hand-coding. Here's a typical example of this approach to ETL:
In a traditional ETL process, data is transferred and processed in batches from source databases to the data warehouse. Building ETL workflows with batch processing typically involves six steps:
Building your own data pipeline can be costly in terms of time and engineering bandwidth. But it's not the only option for moving data from your sources into Redshift. A more efficient solution can automate most of these steps to avoid the painstaking processes of hand-coding and manual extraction and loading.
ETL is usually thought of as a three-step process. Data is extracted, transformed, and then loaded into its source destination. While this approach may be useful for some businesses, it's not always necessary or even preferred.
An alternative to traditional ETL allows users to move their data from source to destination quickly by allowing the destination data warehouse to manage the transformation process. It's a variation known as ELT, which puts the emphasis on extract and load.
You've done your homework and now it's time to replicate your data to Redshift. You can't wait on a long, drawn-out ETL process: you need a tool that delivers speed and reliability. That's where Stitch can help.
Stitch makes it easy to extract data data from more than 90 sources and move it to Redshift in a flash. You'll also have the Stitch support team on your side to help you deliver fast, dependable builds for your team.
Our approach is simple, straightforward, and ready to go right out of the box. No need to wait — set up a free trial in minutes and get your data to Redshift today.