Builders erect houses from blueprints, with architecture constrained by physical limitations. When someone moves in, they decide how the inside of their home should look, and their preferences determine interior decor and design. Data warehouses are similar: their architecture is standardized, but their design is adapted to business and user needs.
Data warehouse architecture is inherent to the main hosting platform or service selected by an organization. It's a built-in, static infrastructure, and hosts all the specific tools and processes that eventually make up the warehouse.
Data warehouse design, however, should be customized with an understanding of an organization's stakeholders and strategic requirements. Short-sighted design choices limit the potential and performance of enterprise data warehouses.
Before discussing data warehouse design, it's important to understand the foundational role that architecture plays in both on-premises and cloud data warehouses.
Data warehouses contain historical, current, and critical enterprise data. They form the storage and processing platform underlying reporting, dashboards, business intelligence, and analytics.
Originally, data warehouses ran on on-premises hardware, and were architected in three distinct tiers:
A major issue associated with on-premises data warehouses is the cost of deployment and management. Businesses must purchase server hardware, set aside space to house it, and devote IT staff to set it up and administer it.
More important, however, is the fact that on-premises hardware is more difficult to scale than cloud-based storage and computing. Managers and decision-makers must get approval for new hardware, assign budgets, and wait for shipment, and then engineers and IT specialists must installation and set up both hardware and software. After installation, operational systems may not be used to capacity, which means money wasted on resources that aren't contributing value.
This type of mission-critical infrastructure requires a high level of expenditure, attention, and employee specialization when deployed on-premises. Modern cloud services don't have these problems, because resources are scalable and pricing is too.
Cloud data warehouses are more adaptable, performant, and powerful than in-house systems. Businesses can save on staffing and can put their IT staff to better use, because their infrastructure is managed by dedicated specialists.
Cloud data warehouses feature column-oriented databases, where the unit of storage is a single attribute, with values from all records. Columnar storage does not change how customers organize or represent data, but allows for faster access and processing.
Cloud data warehouses also offer automatic, near-real-time scalability and greater system reliability and uptime than on-premises hardware, and transparent billing, which allows enterprises to pay only for what they use.
Because cloud data warehouses don't rely on the rigid structures and data modeling concepts inherent in traditional systems, they have diverse architectures.
Data warehouses are relational databases, and they are associated with traditional schemas, which are the ways in which records are described and organized.
Snowflake schema: SqlPac
Star schema: SqlPac
A complex system like a modern data warehouse includes many dependencies. Errors may propagate throughout the system and make mistakes difficult to rectify.
After selecting a data warehouse, an organization can focus on specific design considerations. Here are seven steps that help ensure a robust data warehouse design:
1. User needs: A good data warehouse design should be based on business and user needs. Therefore, the first step in the design procedure is to gather requirements, to ensure that the data warehouse will be integrated with existing business processes and be compatible with long-term strategy. Enterprises must determine the purpose of their data warehouse, any technical requirements, which stakeholders will benefit from the system, and which questions will be answered with improved reporting, business intelligence (BI), and analytics.
2. Physical environment: Enterprises that opt for on-premises architecture must set up the physical environment, including all the servers necessary to power ETL processes, storage, and analytic operations. Enterprises can skip this step if they choose a cloud data warehouse.
3. Data modeling: Next comes data modeling, which is perhaps the most important planning step. The data modeling process should result in detailed, reusable documentation of a data warehouse's implementation. Modelers assess the structure of data in sources, decide how to represent these sources in the data warehouse, and specify OLAP requirements, including level of processing granularity, important aggregations and measures, and high-level dimensions or context.
4. ETL/ELT: The next step is the selection of an ETL/ELT solution. ETL transforms data prior to the loading stage. When businesses used costly in-house analytics systems, it made sense to do as much prep work as possible, including transformations, prior to loading data into the warehouse. However, ELT is a better approach when the destination is a cloud data warehouse. Organizations can transform their raw data at any time, when and as necessary for their use case, and not as a step in the data pipeline.
5. Semantic layer: Next up is designing the data warehouse's semantic layer. Based on previously documented data models, the OLAP server is implemented to support the analytical queries of individual users and to empower BI systems. This step determines the core analytical processing capabilities of the data warehouse, so data engineers should carefully consider time-to-analysis and latency requirements.
6. Reporting layer: With analytic infrastructure built and implemented, an organization can design a reporting layer. An administrator designates groups and end users, describes and enforces permissible access, and implements reporting interfaces or delivery methods.
7. Test and tune: All that remains is to test and tune the completed data warehouse and data pipeline. Businesses should assess data ingestion and ETL/ELT systems, tweak query engine configurations for performance, and validate final reports. This is a continuous process requiring dedicated testing environments and ongoing engagement.
If you choose to work with a cloud data warehouse, you need a way to populate it with the data in your existing databases and SaaS tools. That's where Stitch comes in.
Stitch is a cloud-based ETL tool that pulls your data from more than 100 different sources to key data destinations like Redshift, BigQuery, Snowflake, and Azure SQL Data Warehouse. Set up a free trial now and get data into your warehouse quickly.