© 2017 by Doran Bae 

You are reading Coding Otter, stuffed with articles about big data, data science development, ML in production, managing data science projects, and many more. ​
About the host
I'm Doran Bae, Data Scientist @TVB turning data into products and stories. More about me.

A complete guide to flattening data in Google BigQuery

We use Google BigQuery(BQ, here and thereafter) at work. If you have used SQL before, using BQ feels like a walk in the park until you encounter nested and repeated data. In this blogpost, I will share what I have learned from using BQ in regards to these type of schema.

Oh dear.

Why use nested and repeated data structure?

I am not an expert in database management, so nested schema is just a bad news when working within BQ, but here's what others say:

  • This allows BQ to store complex data structures and relationships between many types of records, but doing so all within one single table

  • It is usually used with repeated items such as orders, order_items, or sessions, and page_loads

  • According to the data engineers at my company, it supposedly saves lots of data space (because you don't need to repeat rows)

I don't know how to transform complex data into the nested format (I dare not to find out any time soon), so I will focus on what to do when I see one. For the following sections, I will use Standard SQL, otherwise specified to be Legacy SQL.

Use cases

1. When I want to query one of the nested columns

The first thing you need to do is to tell BQ to unnest. Standard SQL syntax looks something like this:

SELECT  some_col_name,  
FROM  table_name  
CROSS JOIN UNNEST(table_name.nested_records) AS alias

What the above query is saying is to break up the nested_records into its individual members. Then join each of these members with a clone of the original row.

UNNEST is an operator which "flattens" the ARRAY into a set of rows. UNNEST takes an ARRAY and returns a table with single row for each element in the ARRAY. UNNEST destroys the order of the ARRAY element, so you may need to restore order to the table.

To do so, use the option WITH OFFSET clause to return an additional column with the offset for each array element, then use the ORDER BY clause to order the rows by their offset.

SELECT  some_col_name,  
FROM  table_name  
CROSS JOIN UNNEST(table_name.nested_records) AS alias
ORDER BY offset

There are many ways to do the UNNEST other than using some kind of JOINs. I have found that the following syntax also works

FROM  table_name  
LEFT JOIN UNNEST(table_name.nested_records) AS alias
-- OR
FROM  table_name,UNNEST(nested_records) AS alias

As far as how they are different from each other, I don't think it matters much in simple queries.

2. When I want to query multiple items in one nested records

I personally had the most difficult times managing this concept - more so than querying from multiple (different) nested records (refer to case 3). The trick is this. First, I need to UNNEST the repeated records first; hence, 'UNNEST(nested_records) AS alias'. This operation alone will allow you to query one element in the nested_records. Actually, this is equivalent to case 1 above. Second, I need to add an additional UNNEST to get another item in the nested_records such as below.

SELECT  some_col_name,  
        one_col_name_of_nested_records  alias_2
FROM  table_name,  
UNNEST(nested_records) AS alias_1  UNNEST(two_col_name_of_nested_records) AS alias_2

3. When I want to flatten multiple nested records

I hadn't encountered this too many times, but I think it is worthwhile to know this one as well.

SELECT  alias_1,  
FROM  table_name,  
UNNEST(nested_records_A) AS alias_1  
UNNEST(nested_records_B) AS alias_2

This is it for today! May be I will come back with more complex UNNEST problems next time!


Working with Arrays in Standard SQL (Google Cloud Official Document)

How to FLATTEN Data Using Google BigQuery's Legacy vs Standard SQL

Why Nesting Is So Cool

A Deep Dive into Google BigQuery Architecture

BigQuery Tip: The UNNEST Function

stack overflow: BigQuery: flatten repeated columns

stack overflow: BigQuery: Flattening all repeated fields in nested schema