Delivering the right latency for analytics

Why doesn't Stitch replicate your data in real time? Because it's counterproductive for most Stitch users, for both technical and cost considerations.

Real-time inserts are a problem

Stitch replicates data to data warehouses, such as Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse, among other destinations. Organizations use these data warehouses as the foundation for their data analytics stacks because data warehouses use columnar datastores to efficiently group together data that analysts might want to retrieve together. But the architectural design that makes them efficient at retrieving data makes them unsuitable for row-oriented updates and inserts typical of databases used for online transaction processing (OLTP).

The most efficient way to load data into Redshift is with the COPY command, which "allocates the workload to the cluster nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices," according to Redshift documentation. "You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY." In fact inserting a single row may take as long as appending tens, hundreds, or thousands of records through bulk copies.

Real-time replication degrades the performance of data warehouses, bogging down data loading and taking up processing resources that could be spent creating reports.

Thanks to the elastic nature of the cloud, you could spin up more nodes to handle the increased load, but there's no free lunch. You'll have to pay for the additional nodes you'll need to maintain the level of performance you're comfortable with.

The right rate of replication

As we've worked with the thousands of businesses that use Stitch every day, we've come to understand their priorities. We've found that 15 to 20 minutes' latency is the right tradeoff to optimize data warehouse performance and still get maximum value from analytics, so we've optimized Stitch for that. By doing this, we avoid degrading customers' data warehouse performance and incurring unnecessary costs on their behalf.

Data analytics isn't a race in which the winner is the first to finish replicating data. It's tempting to think that the most up-to-date data will give you the best BI — but that isn't necessarily true. Think about how your company uses data analytics. If it's to give managers the information they need to make better decisions, it doesn't make sense to update your BI dashboards faster than the human brain can process.

For most organizations and for most use cases, data that's a few minutes old is sufficiently up to date. How rapidly are your people and your systems making decisions? For most use cases it's a minimum of several minutes, and it may be hours or days.

That said, some use cases do benefit from near-real-time loading of replicated data. If you run a chatbot to automate answers about a customer's inquiries, it needs to know the context of the customer's most recent interaction, so an appropriate level of latency might be a few seconds. If you're analyzing data from a heart monitoring implant, you might want no more than a second's latency. If you're doing algorithmic trading in the financial markets, you'll want up-to-the-microsecond pricing information. Stitch isn't the optimal tool for these use cases.

Whether you need real-time latency depends on what do you need to do with your data. Whether it's cost-effective depends on what you pay to support your use case. Stitch provides an efficient, cost-effective balance between latency and load. Take advantage of a free trial of Stitch and see for yourself.

Image credit: JamesPaulT