Tutorial: How to flatten the GA4 BigQuery export schema for usage in relational databases
6 min read

Tutorial: How to flatten the GA4 BigQuery export schema for usage in relational databases

In this tutorial I will show you - step by step - how to flatten the Google Analytics 4 export schema with the purpose of using the data in a relational database outside of BigQuery.
Tutorial: How to flatten the GA4 BigQuery export schema for usage in relational databases

As you might have noticed the Google Analytics 4 export in BigQuery is containing nested and repeated fields. This is fine when you know how to query this (this website provides lots of example queries), but becomes an issue when you want to send the data to another (relational) database, like Redshift or Snowflake.

I was inspired by this solution for flattening the Universal Analytics export schema.

For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again! That way you save a lot of space in storage and you queries might get faster.

The Google Analytics 4 schema is quite different, but we'll adapt this idea to make it work for us.

Explore schema

Let's take a look at the schema for a recent export table:

select
    column_name,
    ordinal_position,
    data_type
from
    -- change this to your google analytics 4 export location in bigquery
    `ga4bigquery.analytics_250794857`.INFORMATION_SCHEMA.COLUMNS
where
    table_name = concat('events_',format_date('%Y%m%d',date_sub(current_date(), interval 2 day)))

A nested field will have the data type STRUCT and a repeated field will show an ARRAY data type. Looking at the results of our query we can conclude that three columns contain repeated fields: event_params, user_properties and items.

Row column_name ordinal_position data_type
1 event_date 1 STRING
2 event_timestamp 2 INT64
3 event_name 3 STRING
4 event_params 4 ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64>>>
5 event_previous_timestamp 5 INT64
6 event_value_in_usd 6 FLOAT64
7 event_bundle_sequence_id 7 INT64
8 event_server_timestamp_offset 8 INT64
9 user_id 9 STRING
10 user_pseudo_id 10 STRING
11 user_properties 11 ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64, set_timestamp_micros INT64>>>
12 user_first_touch_timestamp 12 INT64
13 user_ltv 13 STRUCT<revenue FLOAT64, currency STRING>
14 device 14 STRUCT<category STRING, mobile_brand_name STRING, mobile_model_name STRING, mobile_marketing_name STRING, mobile_os_hardware_model STRING, operating_system STRING, operating_system_version STRING, vendor_id STRING, advertising_id STRING, language STRING, is_limited_ad_tracking STRING, time_zone_offset_seconds INT64, browser STRING, browser_version STRING, web_info STRUCT<browser STRING, browser_version STRING, hostname STRING>>
15 geo 15 STRUCT<continent STRING, country STRING, region STRING, city STRING, sub_continent STRING, metro STRING>
16 app_info 16 STRUCT<id STRING, version STRING, install_store STRING, firebase_app_id STRING, install_source STRING>
17 traffic_source 17 STRUCT<name STRING, medium STRING, source STRING>
18 stream_id 18 STRING
19 platform 19 STRING
20 event_dimensions 20 STRUCT
21 ecommerce 21 STRUCT<total_item_quantity INT64, purchase_revenue_in_usd FLOAT64, purchase_revenue FLOAT64, refund_value_in_usd FLOAT64, refund_value FLOAT64, shipping_value_in_usd FLOAT64, shipping_value FLOAT64, tax_value_in_usd FLOAT64, tax_value FLOAT64, unique_items INT64, transaction_id STRING>
22 items 22 ARRAY<STRUCT<item_id STRING, item_name STRING, item_brand STRING, item_variant STRING, item_category STRING, item_category2 STRING, item_category3 STRING, item_category4 STRING, item_category5 STRING, price_in_usd FLOAT64, price FLOAT64, quantity INT64, item_revenue_in_usd FLOAT64, item_revenue FLOAT64, item_refund_in_usd FLOAT64, item_refund FLOAT64, coupon STRING, affiliation STRING, location_id STRING, item_list_id STRING, item_list_name STRING, item_list_index STRING, promotion_id STRING, promotion_name STRING, creative_name STRING, creative_slot STRING>>

Define our strategy

If we want to use the GA4 export schema in a relational database, we will need four tables:

  1. flat_events
  2. flat_event_params
  3. flat_user_properties
  4. flat_items

As GA4 is an event driven analytics tool, the events table is our base: it will contain all top level data about users, events, device, traffic source, ecommerce transactions and so on.

To join in specific data from the other tables we will need some unique identifiers that are available in each table. These join keys are:

  • user_pseudo_id
  • event_timestamp
  • event_name
  • a custom dedup_id*

* This last key is an idea of Alexandre, who used the approach in this article and ran into an issue:

GA4 doesn't send requests right away, rather add them in a queue and send them in batch. That means if a user does the same actions several times in a short amount of time, those events are all sent together, and will have the same timestamp. Consequence is join key (user_pseudo_id + event_name + timestamp) is not always unique, which led to duplication issues when joining flat_events and flat_event_params tables.

To simplify the joins we use a concat to merge them into one identifier: join_key.

Processing

In this tutorial I only provide the queries that you will need when flattening the GA4 data in a different (relational) format. You can use BigQuery's scheduled queries functionality, the BigQuery API or a solution like dbt or Dataform to process the data on the frequency that you need.

Table 1: flat_events

In the base table we want all columns, except event_params, user_properties and items, and create the join_key. The only thing we need from event_params for every row in this table is ga_session_id, this will make our life much easier later on.

select
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
    * except(event_params,user_properties,items),
    concat(user_pseudo_id,event_timestamp,event_name,row_number() over(partition by user_pseudo_id, event_timestamp, event_name)) as join_key
from
    -- change this to your google analytics 4 export location in bigquery
    `ga4bigquery.analytics_250794857.events_*`
where
    -- define static and/or dynamic start and end date
    _table_suffix between '20201201'
    and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))

Table 2: flat_event_params

In this table we need the event key and values data, the join_key and the user_pseudo_id, event_timestamp and event_name. Knowing that we want to calculate measures based on the ga_session_id in the future, we take this value from the specific parameters and apply it to all event parameters in a new column.

select
    max(ga_session_id) over (partition by event_timestamp,event_name) as ga_session_id,
    concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key,
    * except(ga_session_id)
from (
    select
        user_pseudo_id,
        case when event_params.key = 'ga_session_id' then event_params.value.int_value else null end as ga_session_id,
        event_timestamp,
        event_name,
        event_params.key,
        event_params.value.string_value,
        event_params.value.int_value,
        event_params.value.float_value,
        event_params.value.double_value,
        dedup_id
    from (
        select
            row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
            *
        from
            -- change this to your google analytics 4 export location in bigquery
            `ga4bigquery.analytics_250794857.events_*`
        where
            -- define static and/or dynamic start and end date
            _table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
        unnest(event_params) as event_params)

Table 3: flat_user_properties

In this table we need the user properties key and values data, the join_key and the user_pseudo_id, event_timestamp and event_name.

select 
    user_pseudo_id,
    event_timestamp,
    event_name,
    user_properties.key,
    user_properties.value.string_value,
    user_properties.value.int_value,
    user_properties.value.float_value,
    user_properties.value.double_value,
    user_properties.value.set_timestamp_micros,
    concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key
from (
    select 
        row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
        *
    from
        -- change this to your google analytics 4 export location in bigquery
        `ga4bigquery.analytics_250794857.events_*`
    where
    -- define static and/or dynamic start and end date
    _table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
    unnest(user_properties) as user_properties

Table 4: flat_items

In this table we need the event key and values data, the join_key and the user_pseudo_id, event_timestamp and event_name.

select
    user_pseudo_id,
    event_timestamp,
    event_name,
    items.item_id,
    items.item_name,
    items.item_brand,
    items.item_variant,
    items.item_category,
    items.item_category2,
    items.item_category3,
    items.item_category4,
    items.item_category5,
    items.price_in_usd,
    items.price,
    items.quantity,
    items.item_revenue_in_usd,
    items.item_revenue,
    items.item_refund_in_usd,
    items.item_refund,
    items.coupon,
    items.affiliation,
    items.location_id,
    items.item_list_id,
    items.item_list_name,
    items.item_list_index,
    items.promotion_id,
    items.promotion_name,
    items.creative_name,
    items.creative_slot,
    concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key
from (
    select
        row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
        *
    from 
    -- change this to your google analytics 4 export location in bigquery
    `ga4bigquery.analytics_250794857.events_*`
    where
    -- define static and/or dynamic start and end date
    _table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
    unnest(items) as items

Let's query!

Here an example query leveraging the tables you just created for the use in a relational database. In this case we combine data from three tables: flat_events, flat_event_params and flat_items:

select
    events.device.category,
    count(distinct events.user_pseudo_id) as users,
    count(distinct concat(events.user_pseudo_id,events.ga_session_id)) as sessions,
    count(distinct case when event_params.event_name = 'user_engagement' and event_params.key = 'session_engaged' and event_params.string_value = '1' then event_params.user_pseudo_id else null end) as engaged_users,
    count(distinct case when event_params.event_name = 'user_engagement' and event_params.key = 'session_engaged' and event_params.string_value = '1' then concat(events.user_pseudo_id,events.ga_session_id) else null end) as engaged_sessions,
    count(distinct case when items.event_name = 'add_to_cart' then concat(events.user_pseudo_id,events.ga_session_id) else null end) as sessions_with_add_to_cart
from
    -- change this to your google analytics 4 export location in bigquery
    `ga4bigquery.analytics_250794857.flat_events` as events
    left join `ga4bigquery.analytics_250794857.flat_event_params` as event_params on events.join_key = event_params.join_key
    left join `ga4bigquery.analytics_250794857.flat_items` as items on events.join_key = items.join_key
where
    -- define static and/or dynamic start and end date
    event_date between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
    device.category

Make sure you always check your result with the GA4 user interface, especially when you are calculating measures while joining in data from multiple tables. This way the amount of rows will probably increase significantly, which also increases the risk that your calculations will return some weird numbers when you use sum or count.

Now it's your turn!

I hope you've enjoyed this tutorial and feel a bit more confident to utilise your own Google Analytics data in BigQuery. Please let me know if you have any feedback or suggestions to improve the quality of the content.


Enjoying this post? Subscribe to the GA4BigQuery newsletter!

You will receive regular updates (max 1x/week) straight in your inbox. You can unsubscribe anytime.