Boston-based Lovepop designs, customizes, and sells beautiful 3D pop-up greeting cards using Shopify as their ecommerce platform. Shopify tracks a great deal of data for each transaction, and Lovepop uses Periscope Data to analyze that data to derive business intelligence. The faster their analytics queries run, the easier it is to get insights.
The problem was that those queries weren’t running fast enough. To speed them up, Lovepop started using the Periscope Cache, which stores data in Amazon Redshift for faster retrieval and improved query times. But while implementing caching was easy, populating the cache with timely data proved troublesome.
Mike Prentice, principal data engineer at Lovepop explains the problem. “We have slow-to-compute views that we cache in Periscope Data to optimize query time and enable rapid filtering of the data. We often want to join them with raw Shopify data, so we needed to find efficient ways of getting that raw data into the cache.” Unfortunately, not all of the Shopify tables have an ID or date column. That meant that every eight hours, when Lovepop wanted to update the cache, they had to empty and refresh the cached Shopify data, even though only a fraction of the data was new. That was inefficient and time-consuming.
Prentice needed to find a smarter approach. After reading the documentation on how the caching algorithms in Periscope Data work, and banging his head on the table a few times, he realized that “we do have a date/time column we can key off of in the Stitch metadata.”
Lovepop had already been using Stitch to extract data from Shopify and load it into Lovepop’s Redshift data warehouse. The company had chosen Stitch after reviewing other ETL tools because “it was the easiest to set up” and integrated with many of their existing and planned data sources.
Prentice realized that the Stitch data columns
_sdc_received_at, which are appended to each record in the pipeline, make a great proxy for the date/time keys that were absent in some of the tables’ native data. “We could use them in our selection criteria and get only the items that have changed since the last caching run,” Prentice says. That lets Lovepop take advantage of Periscope Data’s Bounded Updates caching strategy:
Every time it’s scheduled, the Bounded Updates strategy replaces the most recent N days of rows. The Periscope Cache rows are replaced by rows in the customer database that have an import column value after the start of the bound.
Lovepop now runs bounded cache updates every eight hours with a window of one day, selecting records based on the _sdc_received_at column (see figure).
Having fresh data available in the cache has made Lovepop’s analysts happier. Before caching, user queries frequently timed out after several minutes; now they’re done in seconds and can be filtered rapidly and aggregated as needed.
Got a tricky ETL issue of your own? Sign up with Stitch and see how we can help.