Data analytics architecture: Integrated stack or best-of-breed components?

In this guest post, Nupanch CEO Akash Agrawal compares the pros and cons of two kinds of data stacks: fully integrated architectures and best-of-breed components.

Data architectures have evolved over the past couple of decades. Many businesses have moved from row-oriented relational databases to columnar databases such as Amazon Redshift, Google BigQuery, and Snowflake, or to object stores like Amazon S3. On the business intelligence side of things, Excel spreadsheets have been replaced by sophisticated BI tools such as Microsoft Power BI, Tableau, Looker, Chartio, Periscope Data, and Mode, which make agile data analytics accessible not just to data scientists but also to business users. The availability of cloud platforms for data warehouses, ETL, and BI has driven many of these changes.

There are two categories of data architectures for cloud-based tools:

  1. Fully integrated: The complete data stack – ingestion of data, modeling, and analysis – is organized and implemented together, typically by a third-party BI platform like Magento Business Intelligence (formerly RJMetrics), Birst, GoodData, or Domo. The third-party platform replicates data from the organization's transactional database and hosts it in a columnar data warehouse to which the organization typically does not have direct access.
  2. Best-of-breed components: In this architecture, the data ingestion layer (and sometimes even the modeling layer) is kept separate from the analysis layer. Tools like Stitch are used to ingest data into data warehouses like Amazon Redshift. The BI platform provides visualization and analysis tools for the data in the warehouse. A notable difference in this architecture is that the organization has direct query access its data.

These architectures have different tradeoffs, notably in areas such as ease of implementation, vendor lock-in, control and ownership, data science capabilities, and historical data storage. Let's look at each of these.

Ease of implementation

A fully integrated solution is the best option for organizations that are looking to quickly set up self-service analytics without an in-house data team. The entire data stack is managed by the BI platform, so there's no need to implement and maintain separate pieces of the stack.

A best-of-breed components architecture is a bit more complex to implement. It usually takes more time to choose and connect the different pieces of the stack. Data is consolidated in a cloud data warehouse or object store, which is the organization's responsibility to manage. A modeling/analysis tool connects to that data warehouse to enable users to generate insights. Organizations opting for this model typically have a small team that's capable of implementing and maintaining all the components of their data stack.

In some cases, organizations may also want to keep the modeling layer within the data warehouse, outside of the BI platform. This requires significantly more effort to implement, and usually includes the use of workflow/scheduling tools to create and test transformations.

Vendor lock-in

Sometimes a business outgrows a component of its data stack, and needs to migrate to a more powerful one. The most common case is when the BI tool is no longer adequate, but the same thing can happen with the data warehouse.

A fully integrated architecture is the most difficult to migrate because nothing the organization has built will be able to be reused. Modeling will have to be set up from scratch, and analyses rebuilt. This can lead to delaying a necessary migration and require high costs to accomplish it.

A best-of-breed components architecture is easier to migrate because the individual pieces are independent. If the BI tool is being migrated, the data ingestion piece remains as is, and only the BI piece needs rebuilding. Migrating data ingestion is as simple as configuring a new tool that loads data to the same data warehouse.

Control and ownership

As organizations grow, they need more and more control over their data stacks, and more freedom to adjust configurations – the frequency of refreshing data on reports, definition changes to key metrics, and integration and embedding with other tools.

A fully integrated architecture offers practically zero control over access to data. Businesses usually cannot access the data in the data warehouse outside of the platform. Data teams can still make reporting-level metric definition changes, but they have limited control over how often data is brought from the data sources to the data warehouse, and subsequently to reports.

A best-of-breed components architecture allows for full control of the data and the process of loading it into a data warehouse. The organization is responsible for any transformations to the data after it's loaded into the data warehouse, whether they are performed by a third-party tool or by the BI platform during report preparation. This architecture makes it easier to make changes to both the frequency of data replication and business logic embodied within reports.

Data science capabilities

A few years ago, data engineers, who move around and model organizations' data, and data scientists, who use that data for prescriptive and predictive analytics, were on different teams. Today, data science increasingly occurs closer to the data warehouse. Data science scripts and tools can be directly connected to highly performant data warehouses to take advantage of an already transformed data structure that's more amenable to analytics.

A fully integrated architecture limits the amount of data science that an organization can set up because that depends on the data science capabilities of the BI platform. The organization has no direct access to the data warehouse.

In comparison, a best-of-breed components architecture offers organizations direct connectivity to their data warehouses, which gives them the option, for example, to use complex R and Python scripts for data science and machine learning.

Historical data storage

Often, ecommerce companies write scripts to store historical data when transactional systems do not do so. For example, they may store inventory snapshots for products when their transactional systems only maintain the current status of inventory. To access inventory numbers from the past, the business may create a script to push its inventory data into a new table in its data warehouse on a daily or weekly basis.

While this is technically possible to do in all architectures, businesses risk losing that data if they migrate from a fully integrated architecture to a new BI platform, because they lack access to the data that's maintained within the data warehouse managed by the BI platform. To save the data from being lost, they would need to manually export the data from the old BI platform before migration and find a way to seamlessly continue to store historical data in the new BI platform.

With direct access and control over data ingestion in the best-of-breed components architecture, organizations can more easily maintain ownership of the historical data that they're pushing. Stitch's Import API is perfect for this use case – it allows an organization to push data to its data warehouse using custom scripts. In the example of pushing historical inventory, the script that pushes to the Import API can be configured to run on a daily basis, storing the available inventory of each product on each day. This is also a great use case for Singer, the open source ETL project that enables movement of data between its source "tap" to the destination "target."

In summary

For businesses without an in-house data team, a fully integrated architecture is a great way to set up self-serve analytics. A platform of this kind needs little maintenance – the only piece to maintain is whatever has been set up within the BI platform. Users can obtain insights easily.

Around the time when skill-specific teams start to form in the organization, businesses often feel the need to start owning some of the data stack so that they have more control over it. This is when they should make a transition to a best-of-breed components architecture, perhaps with the assistance of a team of data specialists to implement and maintain the stack.

Some businesses may also want to separate the modeling layer from the analysis/reporting layer. This often happens around the time the organization wants to use advanced data science tools with its data, and build custom historical data storage.

In the medium to long term, I see organizations choosing a fully integrated architecture less often, and relying more on best-of-breed components architectures, thanks to the increasing availability of easy-to-use tools to set up and manage all components of the data stack. Organizations today are more data-savvy than ever, and executives and business users are becoming capable of implementing data stacks on their own.

Image credit: Anna