This article is an excerpt from the original piece I posted on Medium in 2019, but still relevant for beginners.
For those of you wondering why you should use BigQuery to analyze Google Analytics data anyway, read this excellent piece. Some big advantages:
- No more sampling. Ever.
- Unlimited amount of dimensions
- Combining different scopes in one report (not for the faint of heart!)
- Calculate goal completions, build your own Channel Grouping and correct data errors, all on past data
- Combine Google Analytics data with third party data sources
Steep learning curve
But let’s not become too excited. Truth is that diving into BigQuery can be quite frustrating, once you figure out a lot of the Google Analytics metrics you are used to are nowhere to be found.
What makes BigQuery interesting for Google Analytics users, specifically Premium customers, is that Google can dump raw Google Analytics data into BigQuery daily. While this enables many types of analysis that can’t be performed within the Google Analytics interface, it also doesn’t provide any basic metrics, e.g. bounce rate, to use. (source)
There are two sides to this: the tough part is that I had to calculate every ‘missing’ Google Analytics metric in my queries. The positive effect: my understanding of the metrics on a conceptual level improved considerably.
The BigQuery cookbook helped me out in some cases, but also seemed incomplete and outdated at times. Since Standard SQL syntax is the preferred BigQuery language nowadays and a lot of old Stackoverflow entries are using the Legacy SQL syntax, I spent hours and hours to get my head around the SQL queries I had to write to get the reports I wanted. Apart from the calculated metrics that I needed to take care of, there was another hurdle to cross: nested and repeated fields.
Each row in the Google Analytics BigQuery dump represents a single session and contains many fields, some of which can be repeated and nested, such as the hits, which contains a repeated set of fields within it representing the page views and events during the session, and custom dimensions, which is a single, repeated field . This is one of the main differences between BigQuery and a normal database. (source)
Try this at home
No Google Cloud Billing account? Enter the BigQuery Sandbox, which allows you to use the BigQuery web UI without enabling a billing account. To set up a Google Analytics to BigQuery export you need Google Analytics 360 (part of the Google Marketing Platform).
I assume you have a basic understanding of SQL as a querying language and BigQuery as a database tool. If not, I suggest you follow a SQL introduction course first, as I will not go into details about the SQL syntax, but will focus on how to get your (custom) Google Analytics reports out of BigQuery for analysing purposes. All query examples are in Standard SQL.
However, I recommend you use your own Google Analytics dataset if you want to compare the results of your queries with Google Analytics, because I’ve noticed differences between the Google Merchandise Store data in Google Analytics and the sample BigQuery dataset. I tested the queries on other Google Analytics-accounts and they matched quite well.
To get a good understanding of the
ga_sessions_ table in BigQuery, let’s take a look at the BigQuery Export schema, which gives us an idea of the available raw Google Analytics data fields in BigQuery.
Although you probably will recognize a lot of dimensions and metrics from the Google Analytics UI, I know this schema can be a bit overwhelming. To get a better understanding of our data set, we have to know the structure of the (nested) fields. The next picture represents two rows (= 2 sessions) from a
As you can see our trouble starts if you need custom dimensions, custom metrics or any data on hit-level: i.e. events, pageviews or product data. Let’s query our nested sample set:
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_20160801` limit 2
This gives us 2 rows, which represented as a flat table would look like this:
Remember, only row 2 and 14 in this example are real rows in our table. The other ‘rows’ are in fact nested fields, in most cases
null values. Only the
hits.product columns are populated with values.
To deal with this fields and to be able to query our tables so they meet our needs, we need the
The problem here is that is essentially an array (actually in BigQuery parlance it’s a “repeated record”, but you can think of it as an array). (…) This is where the
unnestfunction comes in. It basically lets you take elements in an array and expand each one of these individual elements. You can then join your original row against each unnested element to add them to your table. (source)
I highly recommend reading this article which explains the
unnest concept in detail with the Firebase Analytics sample data set as an example.
You only have to
unnest records that contain ‘repeated fields’. In case of our Google Analytics data set these could involve:
To make sure you understand the structure of the BigQuery export schema, I encourage you to take look at this interactive visual representation.
Now it's your turn!
I hope you've enjoyed this introduction and feel a bit more confident to utilise your own Google Universal Analytics 360 data in BigQuery. Check out the other tutorials, learn how to query dimensions & metrics and how to replicate GA reports. Please let me know if you have any feedback or suggestions to improve the quality of the content.