Google BigQuery’s cloud-based data warehouse and analytics platform uses a built-in query engine and a highly scalable serverless computing model to process terabytes of data in seconds and petabytes in minutes.
BigQuery is a fast, powerful, and flexible data warehouse that’s tightly integrated with the other services on Google Cloud Platform. It’s cost-efficient, offers use-based pricing, and uses a serverless model.
BigQuery’s columnar database
BigQuery’s column-based storage service is behind the data warehouse’s speed and its ability to handle enormous quantities of data. Data in most relational databases is stored and accessed by row, and that’s an efficient storage scheme for transactional databases. For analytical databases, however, it’s more efficient to store data by column.
To understand why, imagine a database that has 100 columns, only eight of which need to be used for a particular query. If the data is stored in row-based storage, the DBMS has to read each row, and therefore all 100 columns, to extract data from the eight columns of interest. By storing data in columns, a columnar database can process only the eight columns of interest, which makes for faster answers and more efficient use of resources.
The Google ecosystem
BigQuery is part of Google Cloud Platform, and integrates with other GCP services and tools. BigQuery can process data stored in other GCP products, including Cloud Storage, the Cloud SQL relational database service, the Cloud Bigtable NoSQL database, Google Drive, and Spanner, Google’s distributed database.
BigQuery supports standard SQL access, and can also integrate with business intelligence tools such as Tableau and Looker.
Google BigQuery: designed for performance
BigQuery evolved from Dremel, Google’s distributed query engine. Dremel can handle terabytes of data in seconds by taking advantage of distributed computing within a serverless architecture — processing complex queries using multiple servers in parallel to dramatically increase processing speed.
Dremel and BigQuery can scale to thousands of machines by structuring computations as an execution tree. A root server receives an incoming query and routes it to branches, called mixers, which modify incoming queries and deliver them to leaf nodes, called slots. Working in parallel, the leaf nodes do the grunt work of reading and filtering data. The results move back down the tree; the mixers aggregate the results and send them to the root as the answer to the query.
In most data warehouse environments, organizations need to specify and commit to the server hardware on which computations will run. Administrators have to provision for performance, security, elasticity, and reliability. A serverless model circumvents this constraint.
In a serverless model, processing is automatically distributed over a large number of machines working in parallel. Using BigQuery’s serverless model, data engineers and database administrators focus less on infrastructure and more on provisioning servers and gaining insights from data.
SQL and programming language support
Users can access BigQuery via standard SQL, which many users are familiar with. BigQuery also has client libraries for writing applications that access data in Java, Python, C#, Go, Node.js, PHP, and Ruby.
BigQuery can process and run reports on real-time data by leveraging other GCP services and resources. Data warehouses can support analytics after data from multiple sources is consolidated and stored — which often happens in batches throughout the day. In addition to batch processing, BigQuery supports streaming at a rate of millions of rows of data per second.
Data in BigQuery is automatically encrypted when at rest or in transit. BigQuery also has the ability to isolate jobs and handle security for multitenant activity. Because BigQuery is tightly integrated with other GCP products’ security features, organizations can take a holistic view of data security.
Users can share datasets using Google Cloud Identity and Access Management (IAM). Administrators can set permissions for individuals and groups to access datasets, tables, and views.
Google’s Virtual Private Cloud policy controls protect against anyone outside of your organization accessing data or attempting to export data to unauthorized third parties. IAM and VPC work across Google Cloud Platform to eliminate security gaps between products.
BigQuery pricing is based on storage and query processing volume. Storage is charged at two levels. Active storage is charged per month for data that has been modified within the last 90 days. Data not changed within 90 days is charged at a lower monthly rate.
There are also two pricing models for queries. On-demand pricing is based on the amount of data processed by each query. Flat-rate pricing charges customers for dedicated querying resources with no per-query charges. This pricing is designed for organizations that prefer a fixed cost.
Getting data into BigQuery
You need to load your data into BigQuery before you can begin using it to generate business intelligence. A data integration solution can help you automate the complex process of extracting data and loading it into BigQuery. With more than 100 connectors to popular data sources, such as Google Analytics, Google Ads, and Facebook Ads, Stitch makes the job of loading BigQuery fast and simple.
Whether you’re considering your first data warehouse or pondering a move to a new one, try Stitch for free to streamline your data loading operations.