When we selected Redshift as the first data warehouse target for Stitch, we knew it offered great performance for batch processing and analytics workloads. We had experienced this first-hand using it as the data warehouse powering the RJMetrics platform. But we still had some questions about certain Redshift database best practices specific to Stitch.
There's a lot of information out there about Redshift — in Amazon’s documentation, on StackOverflow, and on a number of different personal and company blogs. These helped answer most of our questions, but not all of them — so we performed our own benchmarks to dig a little deeper. This is the first in a series of posts about our findings
It’s already well established that the COPY command is the way to go for loading data into Redshift, but there are a number of different ways it can be used. Today we’ll look at the best data format — CSV, JSON, or Apache Avro — to use for copying data into Redshift.
We decided to use the same dataset used by the TPC-H benchmark for this test. It comprises eight tables of data: nation, region, part, supplier, partsupp, customer, orders, and lineitem. Although these sound like data tables you’d only find in an ecommerce or retail business, the data is centered around users (customer), actions (orders), and properties (lineitems) — so it’s easy to see how analogues can be drawn to many other business models.
We used the
dbgen utility provided by the TPC to generate test data in CSV format. Since this particular benchmark is very simple, we used only the lineitem data — the largest of the data tables. We wrote command-line scripts to convert the CSV file to JSON, and then the JSON file to Avro. This resulted in three files with 375,013,520 rows of the same data, but different file sizes:
Our hypothesis was that COPY performance would correlate to the raw file size on disk, since none of these formats is particularly complex to interpret, and the files are large enough that it takes a long time just to move them across the network.
I uploaded each of these files to Amazon S3, and then loaded from S3 into Redshift using a script that creates a brand new single-node dc1.large cluster, creates a table with the appropriate schema, and runs the COPY. This is the average time of three trials for each of the three file formats:
Contrary to my expectations, the Avro file — the smallest of the three — took the longest time to load. It was so slow that I reached out to Amazon’s Redshift team to make sure I wasn’t doing something wrong. The answer: this is a known issue that they’re working on. Until that happens, I strongly suggest only using CSV-formatted data when COPYing into Redshift.
In part two, I’ll look at how compression of input files affects COPY performance.