We use Google BigQuery(BQ, here and thereafter) at work. If you have used SQL before, using BQ feels like a walk in the park until you encounter nested and repeated data. In this blogpost, I will share what I have learned from using BQ in regards to these type of schema.
Why use nested and repeated data structure?
I am not an expert in database management, so nested schema is just a bad news when working within BQ, but here's what others say:
This allows BQ to store complex data structures and relationships between many types of records, but doing so all within one single table
It is usually used with repeated items such as orders, order_items, or sessions, and page_loads
According to the data engineers at my company, it supposedly saves lots of data space (because you don't need to repeat rows)
I don't know how to transform complex data into the nested format (I dare not to find out any time soon), so I will focus on what to do when I see one. For the following sections, I will use Standard SQL, otherwise specified to be Legacy SQL.
1. When I want to query one of the nested columns
The first thing you need to do is to tell BQ to unnest. Standard SQL syntax looks something like this:
SELECT some_col_name, alias.one_col_name_of_nested_records FROM table_name CROSS JOIN UNNEST(table_name.nested_records) AS alias
What the above query is saying is to break up the nested_records into its individual members. Then join each of these members with a clone of the original row.
UNNEST is an operator which "flattens" the ARRAY into a set of rows. UNNEST takes an ARRAY and returns a table with single row for each element in the ARRAY. UNNEST destroys the order of the ARRAY element, so you may need to restore order to the table.
To do so, use the option WITH OFFSET clause to return an additional column with the offset for each array element, then use the ORDER BY clause to order the rows by their offset.
SELECT some_col_name, alias.one_col_name_of_nested_records FROM table_name CROSS JOIN UNNEST(table_name.nested_records) AS alias WITH OFFSET AS offset ORDER BY offset
There are many ways to do the UNNEST other than using some kind of JOINs. I have found that the following syntax also works
FROM table_name LEFT JOIN UNNEST(table_name.nested_records) AS alias -- OR FROM table_name,UNNEST(nested_records) AS alias
As far as how they are different from each other, I don't think it matters much in simple queries.
2. When I want to query multiple items in one nested records
I personally had the most difficult times managing this concept - more so than querying from multiple (different) nested records (refer to case 3). The trick is this. First, I need to UNNEST the repeated records first; hence, 'UNNEST(nested_records) AS alias'. This operation alone will allow you to query one element in the nested_records. Actually, this is equivalent to case 1 above. Second, I need to add an additional UNNEST to get another item in the nested_records such as below.
SELECT some_col_name, one_col_name_of_nested_records alias_2 FROM table_name, UNNEST(nested_records) AS alias_1 UNNEST(two_col_name_of_nested_records) AS alias_2
3. When I want to flatten multiple nested records
I hadn't encountered this too many times, but I think it is worthwhile to know this one as well.
SELECT alias_1, alias_2 FROM table_name, UNNEST(nested_records_A) AS alias_1 UNNEST(nested_records_B) AS alias_2
This is it for today! May be I will come back with more complex UNNEST problems next time!