Google BigQuery supports nested records within tables, whether it’s a single record or repeated values.

Unlike the conventional method to denormalization, in Google BigQuery records are expressed using nested and repeated fields.

Instead of flattening attributes into a table, this approach localizes a record’s subattributes into a single table. Maintaining nested records removes the need for repeating data, creating additional subtables, or using joins during analysis.

For example: Below is a record from a table named people. In this table, each person can only have a single type, but they might have multiple friends:

{
   "id":1,
   "name":"Finn",
   "details":{
      "type":"human",
      "has_magic":false
   },
   "friends":[
      {
         "id":2,
         "name":"Jake"
      },
      {
         "id":3,
         "name":"Bubblegum"
      },
      {
         "id":4,
         "name":"BMO"
      }
   ]
}

In this guide, we’ll cover how this data will be loaded into Google BigQuery, including:


Storing nested maps (JSON objects)

A nested map is also called an object in JSON. An object is surrounded by curly braces ({ }) and contains a series of key/value pairs. Keys are strings enclosed in double quotes (" ").

For example: This record contains a details object, which contains type and has_magic keys:

{
   "id":1,
   "name":"Finn",
   "details":{
      "type":"human",
      "has_magic":false
   }
}

When records containing objects are loaded into Google BigQuery, the object is loaded using the RECORD type and a mode of NULLABLE.

For example: The above record would create this table schema in Google BigQuery:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
details RECORD NULLABLE
details.type STRING NULLABLE
details.has_magic BOOLEAN NULLABLE

And the data in the table would be similar to the following:

id name details.type details.has_magic
1 Finn human false

To query nested data using the standard SQL syntax, you can use dot notation to indicate the field(s) you want to reference. For example: The sample query below will return the id, name, and details.type fields:

SELECT id,
       name,
       details.type
  FROM people

+----+------+--------------+
| id | name | details.type |
+----+------+--------------+
| 1  | Finn | human        |
+----+------+--------------+

Storing nested records (JSON arrays)

A nested record is also called an array in JSON. An array is surrounded by square brackets ([ ]) and contains an ordered list of values. Values can be strings, numbers, booleans, objects, nulls, or more arrays.

When records containing arrays are loaded into Google BigQuery, the array is loaded using the RECORD type and a mode of REPEATED. By using the REPEATED mode to store nested records, Stitch can avoid repeating data or creating additional subtables. This functionality removes the need for joins when analyzing data, making raw data easier to read and faster to compute.

For items in the array, Stitch will handle each item like an object field. Items will be loaded using the RECORD type, a mode of NULLABLE, and a field name of value.

How the array is structured determines how the data it contains will be loaded into Google BigQuery. In this section, we’ll cover some examples and demonstrate how the source data for each one would be loaded into Google BigQuery:

Refer to Google’s documentation for more info on querying nested records.

Array of strings or numbers

In this example, the record contains two arrays: friends, an array of strings, and friend_ids, an array of integers:

{
   "id":1,
   "name":"Finn",
   "friends":["Jake","Bubblegum","BMO"],
   "friend_ids":[2, 3, 4]
}

The above record would create this table schema in Google BigQuery:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
friends RECORD REPEATED
friends.value STRING NULLABLE
friend_ids RECORD REPEATED
friend_ids.value INTEGER NULLABLE

And the data in the table would be similar to the following:

id name friends.value friend_ids.value
1 Finn Jake 2
Bubblegum 3
BMO 4

Array of objects

In this example, the record contains a single array named friends, which contains a series of objects:

{
   "id":1,
   "name":"Finn",
   "friends":[
      {
         "id":2,
         "name":"Jake"
      },
      {
         "id":3,
         "name":"Bubblegum"
      },
      {
         "id":4,
         "name":"BMO"
      }
   ]
}

The above record would create this table schema in Google BigQuery:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
friends RECORD REPEATED
friends.value RECORD NULLABLE
friends.value.id INTEGER NULLABLE
friends.value.name STRING NULLABLE

And the data in the table would be similar to the following:

id name friends.value.id friends.value.name
1 Finn 2 Jake
3 Bubblegum
4 BMO

Array of arrays

In this example, the record contains an array (friend_ids) which contains a series of arrays:

{
   "id":1,
   "name":"Finn",
   "friend_ids":[
      [2,3],
      [4,5]
   ]
}

The above record would create this table schema in Google BigQuery:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
friend_ids RECORD REPEATED
friend_ids.value RECORD REPEATED
friend_ids.value.value INTEGER NULLABLE

And the data in the table would be similar to the following:

id name friend_ids.value.value
1 Finn 2
3
4
5

Array containing multiple data types

In this example, the record contains a single array named friend_ids. Notice that the first two values in the array are strings (ex: "2" versus 2), and the last value is an integer (ex: 4 versus "4"):

{
   "id":1,
   "name":"Finn",
   "friend_ids":["2", "3", 4]
}

To accommodate the multiple data types, Stitch will create additional value columns, one for each data type, and append a data type suffix to the name of each additional column.

In this example, the friend_ids.value column will store all INTEGER data, and Stitch will create an additional friend_ids.value__st column to store all STRING data:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
friend_ids RECORD REPEATED
friend_ids.value INTEGER NULLABLE
friend_ids.value__st STRING NULLABLE

And the data in the table would be similar to the following:

id name friend_ids.value friend_ids.value__st
1 Finn 2
3
4

Array of nested arrays

In this example, the record contains an array (friend_ids) which contains a series of nested arrays:

{
   "id":1,
   "name":"Finn",
   "friend_ids":[
      [
        [2,3]
      ],
      [
        [4,5]
      ]
   ]
}

The above record would create this table schema in Google BigQuery:

Field name Type Mode
id INTEGER NULLABLE
name STRING NULLABLE
friend_ids RECORD REPEATED
friend_ids.value RECORD REPEATED
friend_ids.value.value RECORD REPEATED
friend_ids.value.value.value INTEGER NULLABLE

And the data in the table would be similar to the following:

id name friend_ids.value.value.value
1 Finn 2
3
4
5

Resources