In part one of this series we found that CSV is the most performant input format for loading data with Redshift’s COPY command. In this post we’re once again looking at COPY performance, this time using three different input file compression algorithms: bzip2, gzip, and LZO.
What did we find? If you’re looking for a good default option for compressing COPY files to Redshift, the data indicates you should go with the LZO algorithm. Read on for more information about how we conducted the experiments, the details of our findings, and further explanation as to why the data came out the way it did.
Methodology and measurement
Redshift supports compression on input files of any format. We used CSV-formatted files since we already knew that format is the fastest. We once again used the dbgen utility provided by the TPC to generate test data composed of eight tables:
Although these sound like data tables you’d only find in an ecommerce or retail business, the data is centered around users (customer), actions (orders), and properties (lineitems), so it’s easy to see how analogues can be drawn to other business models. One input to the data generator is a scale factor that determines the size of the resulting data set. We used three different scale factors — 1, 10, and 100 — to see how compression impacts different-sized inputs. The table below shows the aggregate length and size of the resulting datasets:
As you can see, the scale factor corresponds to roughly the number of gigabytes in the resulting data set.
Compression at different data sizes
The chart below shows the inverse compression ratio that each algorithm achieves at each scale. Put another way, the percentages are the size of the compressed files as a percentage of the original file size, so lower is better. (Note that we are using LZO at compression level 3.)
We expected the compression ratio for a single algorithm to be consistent across all scales to within a small degree of variability, and this is what the data shows. So what does this mean for COPY speed?
Copy performance hypothesis
Compression creates a tradeoff between disk space, IO, and CPU consumption. Although it’s not a universal rule, it’s often true that algorithms producing higher compression ratios — i.e. smaller compressed files — demand more CPU consumption. Compression algorithms can also be tailored for certain types of input data. This means that picking the right compression algorithm depends on the type of data you’re compressing and the amount of each of those resources you have to spare.
Let’s use this framework to look at a Redshift COPY. We’re primarily concerned with speed, and the input data lives on Amazon S3, which is both cheap and plentiful, so disk space is not a major concern.
Even though we don’t know exactly how Redshift works internally, we know a COPY must use disk IO to read the input data off of S3, and network IO to transfer it from S3 to the Redshift cluster. We couldn’t find documentation about network transfer performance between S3 and Redshift, but AWS supports up to 10Gbit/s on EC2 instances, and this is probably what Redshift clusters support as well. This is a fast connection — even operating at 80% of this speed, we would expect to spend only one second on network transfer per gigabyte of input data.
Finally there’s CPU, which again we have limited information about since Redshift is a closed system. However, we know that Redshift performs its own compression before storing data internally, which requires CPU, so we can reasonably assume that the CPU is going to be busy doing more than just decompressing our input files.
Given all of the above, our hypothesis was that for smaller data sets, the benefit of compression would be outweighed by the processing cost of decompression, but for larger data sets compression would yield a performance improvement. Further, since CPU is likely our most precious resource, we expected to find that the least CPU-intensive compression algorithms — and possibly even no compression — would perform COPYs the fastest.
Copy performance results
I generated, compressed, and uploaded test data to S3 using this script, and then loaded the data from S3 into Redshift using this script that creates a brand new single-node dc1.large cluster, creates tables with the appropriate schemas, and COPYs each table. This is the combined time to load all tables, separated by scale factor and compression algorithm, normalized by the uncompressed load time at each scale factor:
The data bears out some of our hypotheses: bzip2 — the algorithm with the highest compression ratio — is indeed the slowest COPY performer, and the benefit of compression is more substantial at scale 100 than at scale 1. The clearest takeaway is that LZO compression is the most performant at every scale factor we tested. The LZO — or Lempel-Ziv-Oberhumer — algorithm was designed specifically to optimize for fast decompression, so this result is not too surprising, and helps to validate the idea that we are primarily CPU-bound when COPYing compressed data.
To see the performance at even smaller data sizes, we looked at a single table within each dataset — the customers table — which comprises 150,000 rows at scale 1, 1.5M rows at scale 10, and 15M rows at scale 100. The COPY performance of this table alone was:
Here we see that compression actually slows down COPY time at the smallest scale 1 input, which weighed in at 23MB uncompressed. However, at scale 10 (file size 230MB), both LZO and gzip compression provide a speed boost.
Key takeaway: default to LZO
At small file sizes — less than 50MB or so — the speed benefit of compression is typically going to be negligible, and may even hurt performance. At larger file sizes, LZO is the best default option. The data shows that gzip gains parity with LZO for large — 50GB+ — files, and the trend suggests that it might even prove to be more performant for much larger files.
Of course, COPY speed isn’t the only reason to use compression — it will also save you money on your S3 bill. On the other hand, we didn’t account for the additional computing power required to compress the files in the first place. All of these factors should weigh in to the selection of the optimal compression strategy. But, if you need a good default option, choose LZO.