Tutorial: How to flatten the GA4 BigQuery export schema for usage in relational databases

As you might have noticed the Google Analytics 4 export in BigQuery is containing nested and repeated fields. This is fine when you know how to query this (this website provides lots of example queries), but becomes an issue when you want to send the data to another (relational) database, like Redshift or Snowflake.
I was inspired by this solution for flattening the Universal Analytics export schema.
For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again! That way you save a lot of space in storage and you queries might get faster.
The Google Analytics 4 schema is quite different, but we'll adapt this idea to make it work for us.
Explore schema
Let's take a look at the schema for a recent export table:
select
column_name,
ordinal_position,
data_type
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857`.INFORMATION_SCHEMA.COLUMNS
where
table_name = concat('events_',format_date('%Y%m%d',date_sub(current_date(), interval 2 day)))
A nested field will have the data type STRUCT
and a repeated field will show an ARRAY
data type. Looking at the results of our query we can conclude that three columns contain repeated fields: event_params
, user_properties
and items
.
Row | column_name | ordinal_position | data_type | |
---|---|---|---|---|
1 | event_date | 1 | STRING | |
2 | event_timestamp | 2 | INT64 | |
3 | event_name | 3 | STRING | |
4 | event_params | 4 | ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64>>> | |
5 | event_previous_timestamp | 5 | INT64 | |
6 | event_value_in_usd | 6 | FLOAT64 | |
7 | event_bundle_sequence_id | 7 | INT64 | |
8 | event_server_timestamp_offset | 8 | INT64 | |
9 | user_id | 9 | STRING | |
10 | user_pseudo_id | 10 | STRING | |
11 | user_properties | 11 | ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64, set_timestamp_micros INT64>>> | |
12 | user_first_touch_timestamp | 12 | INT64 | |
13 | user_ltv | 13 | STRUCT<revenue FLOAT64, currency STRING> | |
14 | device | 14 | STRUCT<category STRING, mobile_brand_name STRING, mobile_model_name STRING, mobile_marketing_name STRING, mobile_os_hardware_model STRING, operating_system STRING, operating_system_version STRING, vendor_id STRING, advertising_id STRING, language STRING, is_limited_ad_tracking STRING, time_zone_offset_seconds INT64, browser STRING, browser_version STRING, web_info STRUCT<browser STRING, browser_version STRING, hostname STRING>> | |
15 | geo | 15 | STRUCT<continent STRING, country STRING, region STRING, city STRING, sub_continent STRING, metro STRING> | |
16 | app_info | 16 | STRUCT<id STRING, version STRING, install_store STRING, firebase_app_id STRING, install_source STRING> | |
17 | traffic_source | 17 | STRUCT<name STRING, medium STRING, source STRING> | |
18 | stream_id | 18 | STRING | |
19 | platform | 19 | STRING | |
20 | event_dimensions | 20 | STRUCT |
|
21 | ecommerce | 21 | STRUCT<total_item_quantity INT64, purchase_revenue_in_usd FLOAT64, purchase_revenue FLOAT64, refund_value_in_usd FLOAT64, refund_value FLOAT64, shipping_value_in_usd FLOAT64, shipping_value FLOAT64, tax_value_in_usd FLOAT64, tax_value FLOAT64, unique_items INT64, transaction_id STRING> | |
22 | items | 22 | ARRAY<STRUCT<item_id STRING, item_name STRING, item_brand STRING, item_variant STRING, item_category STRING, item_category2 STRING, item_category3 STRING, item_category4 STRING, item_category5 STRING, price_in_usd FLOAT64, price FLOAT64, quantity INT64, item_revenue_in_usd FLOAT64, item_revenue FLOAT64, item_refund_in_usd FLOAT64, item_refund FLOAT64, coupon STRING, affiliation STRING, location_id STRING, item_list_id STRING, item_list_name STRING, item_list_index STRING, promotion_id STRING, promotion_name STRING, creative_name STRING, creative_slot STRING>> |
Define our strategy
If we want to use the GA4 export schema in a relational database, we will need four tables:
- flat_events
- flat_event_params
- flat_user_properties
- flat_items
As GA4 is an event driven analytics tool, the events
table is our base: it will contain all top level data about users, events, device, traffic source, ecommerce transactions and so on.
To join in specific data from the other tables we will need some unique identifiers that are available in each table. These join keys are:
- user_pseudo_id
- event_timestamp
- event_name
- a custom dedup_id*
* This last key is an idea of Alexandre, who used the approach in this article and ran into an issue:
GA4 doesn't send requests right away, rather add them in a queue and send them in batch. That means if a user does the same actions several times in a short amount of time, those events are all sent together, and will have the same timestamp. Consequence is join key (user_pseudo_id + event_name + timestamp) is not always unique, which led to duplication issues when joining flat_events and flat_event_params tables.
To simplify the joins we use a concat
to merge them into one identifier: join_key
.
Processing
In this tutorial I only provide the queries that you will need when flattening the GA4 data in a different (relational) format. You can use BigQuery's scheduled queries functionality, the BigQuery API or a solution like dbt or Dataform to process the data on the frequency that you need.
Table 1: flat_events
In the base table we want all columns, except event_params
, user_properties
and items
, and create the join_key
. The only thing we need from event_params
for every row in this table is ga_session_id
, this will make our life much easier later on.
select
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
* except(event_params,user_properties,items),
concat(user_pseudo_id,event_timestamp,event_name,row_number() over(partition by user_pseudo_id, event_timestamp, event_name)) as join_key
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857.events_*`
where
-- define static and/or dynamic start and end date
_table_suffix between '20201201'
and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
Table 2: flat_event_params
In this table we need the event key and values data, the join_key
and the user_pseudo_id
, event_timestamp
and event_name
. Knowing that we want to calculate measures based on the ga_session_id
in the future, we take this value from the specific parameters and apply it to all event parameters in a new column.
select
max(ga_session_id) over (partition by event_timestamp,event_name) as ga_session_id,
concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key,
* except(ga_session_id)
from (
select
user_pseudo_id,
case when event_params.key = 'ga_session_id' then event_params.value.int_value else null end as ga_session_id,
event_timestamp,
event_name,
event_params.key,
event_params.value.string_value,
event_params.value.int_value,
event_params.value.float_value,
event_params.value.double_value,
dedup_id
from (
select
row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
*
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857.events_*`
where
-- define static and/or dynamic start and end date
_table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
unnest(event_params) as event_params)
Table 3: flat_user_properties
In this table we need the user properties key and values data, the join_key
and the user_pseudo_id
, event_timestamp
and event_name
.
select
user_pseudo_id,
event_timestamp,
event_name,
user_properties.key,
user_properties.value.string_value,
user_properties.value.int_value,
user_properties.value.float_value,
user_properties.value.double_value,
user_properties.value.set_timestamp_micros,
concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key
from (
select
row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
*
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857.events_*`
where
-- define static and/or dynamic start and end date
_table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
unnest(user_properties) as user_properties
Table 4: flat_items
In this table we need the event key and values data, the join_key
and the user_pseudo_id
, event_timestamp
and event_name
.
select
user_pseudo_id,
event_timestamp,
event_name,
items.item_id,
items.item_name,
items.item_brand,
items.item_variant,
items.item_category,
items.item_category2,
items.item_category3,
items.item_category4,
items.item_category5,
items.price_in_usd,
items.price,
items.quantity,
items.item_revenue_in_usd,
items.item_revenue,
items.item_refund_in_usd,
items.item_refund,
items.coupon,
items.affiliation,
items.location_id,
items.item_list_id,
items.item_list_name,
items.item_list_index,
items.promotion_id,
items.promotion_name,
items.creative_name,
items.creative_slot,
concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key
from (
select
row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
*
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857.events_*`
where
-- define static and/or dynamic start and end date
_table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
unnest(items) as items
Let's query!
Here an example query leveraging the tables you just created for the use in a relational database. In this case we combine data from three tables: flat_events
, flat_event_params
and flat_items
:
select
events.device.category,
count(distinct events.user_pseudo_id) as users,
count(distinct concat(events.user_pseudo_id,events.ga_session_id)) as sessions,
count(distinct case when event_params.event_name = 'user_engagement' and event_params.key = 'session_engaged' and event_params.string_value = '1' then event_params.user_pseudo_id else null end) as engaged_users,
count(distinct case when event_params.event_name = 'user_engagement' and event_params.key = 'session_engaged' and event_params.string_value = '1' then concat(events.user_pseudo_id,events.ga_session_id) else null end) as engaged_sessions,
count(distinct case when items.event_name = 'add_to_cart' then concat(events.user_pseudo_id,events.ga_session_id) else null end) as sessions_with_add_to_cart
from
-- change this to your google analytics 4 export location in bigquery
`ga4bigquery.analytics_250794857.flat_events` as events
left join `ga4bigquery.analytics_250794857.flat_event_params` as event_params on events.join_key = event_params.join_key
left join `ga4bigquery.analytics_250794857.flat_items` as items on events.join_key = items.join_key
where
-- define static and/or dynamic start and end date
event_date between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
device.category
Make sure you always check your result with the GA4 user interface, especially when you are calculating measures while joining in data from multiple tables. This way the amount of rows will probably increase significantly, which also increases the risk that your calculations will return some weird numbers when you use sum
or count
.
Now it's your turn!
I hope you've enjoyed this tutorial and feel a bit more confident to utilise your own Google Analytics data in BigQuery. Please let me know if you have any feedback or suggestions to improve the quality of the content.
Enjoying this post? Subscribe to the GA4BigQuery newsletter!
You will receive regular updates (max 1x/week) straight in your inbox. You can unsubscribe anytime.