Amazon Redshift debuted in 2012 as the first cloud data warehouse, and remains the most popular one today. But Redshift wasn't developed in-house. Amazon invested $20 million in a company called ParAccel, and in return gained the license to use code from ParAccel Analytic Database (PADB) for Redshift. PADB was notable because it was a columnar database that ran on commodity hardware, which made it a natural choice as a basis for a cloud-based analytic database platform.
While that deal might have seemed worthwhile for ParAccel at the time, it worked out even better for Amazon in the long run. In 2015 Amazon CTO Werner Vogels called Redshift "the fastest-growing service in AWS, ever." Meanwhile, ParAccel was acquired by Actian in 2013, and PADB was renamed Actian Matrix.
PADB was itself based on PostgreSQL – so to some extent, Redshift is based on PostgreSQL – but "based on" leaves a lot of room for difference. If you're familiar with PostgreSQL features and syntax, how easy will it be to get used to Redshift?
First, there are architectural differences between Redshift and PostgreSQL:
- Under the hood, PostgreSQL is a traditional row-oriented relational database, great for processing transactional data. Redshift is a columnar database better suited for analytics, and thus a more appropriate platform for a data warehouse.
- In PostgreSQL a single database connection cannot utilize more than one CPU, while Redshift is architected for parallel processing across multiple nodes.
- When you load data into a Redshift table, Redshift distributes the rows of the table across nodes according to the table's distribution style. One of the distribution styles is key distribution, in which the rows are distributed according to the values in a specified column. Redshift's documentation says, "The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together." PostgreSQL lacks distribution styles and distribution keys.
- Redshift doesn't support indexes. Instead, each table has a sort key, which determines how rows are ordered when the data is loaded. When you insert, update, or copy data in a Redshift table, new rows get added to an unsorted region, and are sorted only when the table is vacuumed or deep copied.
- Redshift doesn't enforce primary key, foreign key, or uniqueness constraints, though Amazon says "primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity."
The architectural changes Amazon made to Redshift make it better able to handle large volumes of data for analytical queries. PostgreSQL can serve as a data warehouse for smaller volumes of data, but it can't match the performance of Redshift's column-oriented architecture. But the good news is that if you're familiar with PostgreSQL commands and concepts, learning Redshift isn't like learning a foreign language. It's more like the difference between dialects – the American and British versions of English, for example.
Differences in SQL
Both databases use SQL as their native language. While a lot of the two platforms' SQL syntax is the same, there are plenty of differences as well. The list of Redshift SQL commands differs from the list of PostgreSQL commands, and even when both platforms implement the same command, their syntax is often different.
Per Amazon's documentation, here are some of the major differences between Redshift and PostgreSQL SQL commands:
Redshift doesn't support tablespaces, table partitioning, inheritance, and certain constraints. The Redshift implementation of CREATE TABLE enables you to define the sort and distribution algorithms for tables to optimize parallel processing.
Redshift doesn't support ALTER COLUMN actions, and ADD COLUMN supports adding only one column in each ALTER TABLE statement.
The Redshift COPY command is specialized to enable loading of data from Amazon S3 buckets and Amazon DynamoDB tables and to facilitate automatic compression.
INSERT, UPDATE, and DELETE
Redshift doesn't support the WITH clause.
The parameters for VACUUM are different between the two databases. For example, the default VACUUM operation in PostgreSQL reclaims space and makes it available for reuse. The default VACUUM operation in Redshift is VACUUM FULL, which reclaims disk space and resorts all rows.
Unsupported features and datatypes
Amazon lists many PostgreSQL features and functions that are not supported in Redshift. In addition to the aforementioned lack of support for constraints and indexes, the list of unsupported features includes stored procedures, triggers, value expressions, and inheritance.
Redshift lacks support for several PostgreSQL data types:
- BIT, BIT VARYING
- Composite types
- Date/time types INTERVAL and TIME
- Enumerated types
- Geometric types
- Network address types
- Numeric types SERIAL, BIGSERIAL, SMALLSERIAL, and MONEY
- Object identifier types
- Range types
- Text search types
Redshift does support these data types:
|SMALLINT||INT2||Signed two-byte integer|
|INTEGER||INT, INT4||Signed four-byte integer|
|BIGINT||INT8||Signed eight-byte integer|
|DECIMAL||NUMERIC||Exact numeric of selectable precision|
|REAL||FLOAT4||Single precision floating-point number|
|DOUBLE PRECISION||FLOAT8, FLOAT||Double precision floating-point number|
|BOOLEAN||BOOL||Logical Boolean (true/false)|
|CHAR||CHARACTER, NCHAR, BPCHAR||Fixed-length character string|
|VARCHAR||CHARACTER VARYING, NVARCHAR, TEXT||Variable-length character string with a user-defined limit|
|DATE||Calendar date (year, month, day)|
|TIMESTAMP||TIMESTAMP WITHOUT TIME ZONE||Date and time (without time zone)|
|TIMESTAMPTZ||TIMESTAMP WITH TIME ZONE||Date and time (with time zone)|
If you have SQL skills you developed from working with PostgreSQL, you'll be able to get by in Amazon Redshift pretty well – but you'll have to familiarize yourself with the differences between the two platforms. Redshift is optimized for analytical queries, and its commands and syntax reflect that. As long as you keep a command reference page bookmarked, you should be fine.