Skip to main content

GA4 | tips & tricks

Using the STRING_AGG function: Analyzing Full User Paths Before Conversion

This article shows a meaningful use of the STRING_AGG function on GA4 export data, which allows concatenating a previously undefined number of elements into a character separated single value, thereby visualizing customer journeys.

Editor's note: this article is the second guest post originating from a LinkedIn article by Enio Kurtesi. Similarly to the first one, the entire article is re-published, with minor additions from me. This time it is categorized under GA4 | tip & tricks as (in my view) it is best viewed as a good example of using a particular function (STRING_AGG), rather than a comprehensive analysis piece. You can check out Enio's LinkedIn profile here.

After landing on a specific page, what paths do users most commonly take through my website before they convert?" The answer to this question can reshape your understanding of user intent, reveal unexpected behavioral patterns, and ultimately guide strategic decisions about site architecture, content placement, and conversion optimization.

In this article, I'll walk you through an advanced BigQuery approach to mapping complete user journeys from specific landing pages using GA4 data. This technique goes beyond simple next-page analysis, allowing you to visualize entire paths from entry point to exit, quantify their frequency, and understand the complexity of different journey variations.

Why Does Journey Analysis Matter?

Traditional web analytics often focuses on isolated metrics like pageviews, sessions, and tracked events. While these metrics provide valuable snapshots of performance, they fail to capture the sequential narrative of the user experience. Journey analysis fills this gap by illustrating the actual paths users take through your website, showing how they engage with different pages and features on their way to completing key actions, giving you a deeper, more user-centered view of the experience.

This perspective shift from isolated metrics to connected journeys offers several key benefits:

First, it reveals unexpected behavioral patterns that might contradict your assumptions about how users interact with your site. Perhaps users are taking longer routes to conversion than you anticipated, or maybe they're discovering your key offerings through unexpected entry points. These insights can highlight opportunities for streamlining the user experience.

Second, journey analysis helps identify drop-off points where users abandon their intended paths. By pinpointing exactly where users divert from the desired journey, you can focus optimization efforts on the most problematic areas of your site.

Third, understanding common paths allows you to assess the effectiveness of your site architecture and navigation. If users consistently take indirect routes to important destinations, it might indicate that your intuitive site organization doesn't align with user expectations.

Fourth, this approach enables more sophisticated segmentation. Different user segments often follow distinctive paths through your site. Identifying these segment-specific journeys allows for more targeted personalization and marketing strategies.

The Technical Approach

Our goal is to analyze complete user journeys that begin at a specific landing page. We want to identify the most common paths, understand their complexity, and quantify how frequently each path occurs.

WITH base AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
    event_timestamp
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_2021*`
  WHERE
    event_name = 'page_view'
),
sessions_with_purchase AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_2021*`
  WHERE
    event_name = 'purchase'
  GROUP BY
    user_pseudo_id,
    session_id
),
sessions_with_landing AS (
  SELECT
    user_pseudo_id,
    session_id,
    MIN(event_timestamp) AS landing_timestamp
  FROM
    base
  WHERE
    page_location = 'https://shop.googlemerchandisestore.com/'  -- Replace with your actual landing page URL
  GROUP BY
    user_pseudo_id,
    session_id
),
ordered_pageviews AS (
  SELECT
    b.user_pseudo_id,
    b.session_id,
    b.page_location,
    b.event_timestamp,
    ROW_NUMBER() OVER (PARTITION BY b.user_pseudo_id, b.session_id ORDER BY b.event_timestamp) AS step_number
  FROM
    base b
  JOIN
    sessions_with_landing s
  ON
    b.user_pseudo_id = s.user_pseudo_id
    AND b.session_id = s.session_id
  JOIN
    sessions_with_purchase p
  ON
    b.user_pseudo_id = p.user_pseudo_id
    AND b.session_id = p.session_id
  WHERE
    b.event_timestamp >= s.landing_timestamp
),
path_strings AS (
  SELECT
    user_pseudo_id,
    session_id,
    STRING_AGG(CONCAT(CAST(step_number AS STRING), '. ', page_location), ' > ' ORDER BY step_number) AS full_path,
    COUNT(*) AS page_count
  FROM
    ordered_pageviews
  GROUP BY
    user_pseudo_id,
    session_id
)
SELECT
  full_path,  COUNT(*) AS path_count,
  MAX(page_count) AS pages_in_path
FROM
  path_strings
  
GROUP BY
  full_path
  having pages_in_path > 1
ORDER BY
  path_count DESC 

Let's break down our approach by examining each component of the query.

Step 1: Extracting Base Pageview Data

WITH base AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
    event_timestamp
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_2021*`
  WHERE
    event_name = 'page_view'
)

In this initial step, we're extracting four essential pieces of information from each page_view event: the anonymized user identifier (user_pseudo_id), the session identifier (session_id), the page URL (page_location), and the timestamp of the event (event_timestamp).

The syntax for extracting session_id and page_location is necessary because GA4 stores these values within a nested array structure called event_params. By using the UNNEST function combined with a subquery, we can efficiently extract these values from their nested location.

Step 2: Identifying Sessions With Purchase Event

sessions_with_purchase AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_2021*`
  WHERE
    event_name = 'purchase'
  GROUP BY
    user_pseudo_id,
    session_id
)

This CTE identifies all sessions that resulted in a purchase event.

Step 3: Identifying Sessions with Our Target Landing Page

sessions_with_landing AS (
  SELECT
    user_pseudo_id,
    session_id,
    MIN(event_timestamp) AS landing_timestamp
  FROM
    base
  WHERE
    page_location = 'https://shop.googlemerchandisestore.com/'  -- Replace with your actual landing page URL
  GROUP BY
    user_pseudo_id,
    session_id
)

This CTE filters our base dataset to include only sessions where the specified landing page was viewed. The MIN(event_timestamp) function identifies the earliest timestamp when this page was viewed within each session. This ensures that we're capturing true landing page visits rather than cases where users returned to this page later in their journey.

By grouping by both user_pseudo_id and session_id, we're creating a unique identifier for each session.

Editor's note: in this step you can get creative with the WHERE clause to narrow down the analysis to what is particularly interesting. For example, a specific landing page can be selected as a starting point (as per the example above), or excluding certain pages (e.g. where the whole session starts at checkout, which is most probably "noise" or a tracking error). The WHERE clause can also be removed to reveal all the different journey combinations.

Step 4: Sequencing Pageviews Within Conversion Sessions

ordered_pageviews AS (
  SELECT
    b.user_pseudo_id,
    b.session_id,
    b.page_location,
    b.event_timestamp,
    ROW_NUMBER() OVER (PARTITION BY b.user_pseudo_id, b.session_id ORDER BY b.event_timestamp) AS step_number
  FROM
    base b
  JOIN
    sessions_with_landing s
  ON
    b.user_pseudo_id = s.user_pseudo_id
    AND b.session_id = s.session_id
  JOIN
    sessions_with_purchase p
  ON
    b.user_pseudo_id = p.user_pseudo_id
    AND b.session_id = p.session_id
  WHERE
    b.event_timestamp >= s.landing_timestamp
),

The first JOIN with "sessions_with_landing" ensures we only analyze sessions that included our target landing page. The second JOIN with "sessions_with_purchase" ensures we only look at sessions that resulted in actual purchases.

The ROW_NUMBER() window function then sequences these pageviews chronologically within each session, creating the step-by-step journey progression that forms the foundation of our path analysis.

💡
The difference between the ROW_NUMBER() and the RANK() function (and the DENSE_RANK function) is that ROW_NUMBER() increments with every row, even if there is a tie on the ordering column. In our case, this is what we are looking for, especially because the event_timestamp will be exactly the same for events sent in the same batch.
Editor's note: if you want to use this in an actual analysis, deduplicating page_views might be considered where subsequent pageviews happen on the same page_location. (In some situations it may be relevant to keep the same-page reloads though.)

Step 5: Constructing Complete Conversion Path Strings

path_strings AS (
  SELECT
    user_pseudo_id,
    session_id,
    STRING_AGG(CONCAT(CAST(step_number AS STRING), '. ', page_location), ' > ' ORDER BY step_number) AS full_path,
    COUNT(*) AS page_count
  FROM
    ordered_pageviews
  GROUP BY
    user_pseudo_id,
    session_id
)

This CTE transforms our sequenced pageviews into readable journey strings. The STRING_AGG function concatenates all pages within each session, creating a visual representation of the complete path from landing page to purchase.

💡
The nice thing about STRING_AGG is that we do not need to know in advance how many values we're going to concatenate. As the example shows, it also allows including different separators, thereby enabling the customization for the actual output.

Step 6: FINAL

SELECT
  full_path,
  COUNT(*) AS path_count,
  MAX(page_count) AS pages_in_path
FROM
  path_strings
  
GROUP BY
  full_path
  having pages_in_path > 1
ORDER BY
  path_count DESC 

The final SELECT aggregates our conversion paths to identify patterns. We're counting how many times we had each distinct path to purchase, and showing the number of pages in each journey.

Article content
Output of the user path conversion script (picture from original article)

Conclusion

We hope you find this approach valuable for your own analytics projects. Feel free to share your own approach or ideas for expanding this analysis in the comments!

Now it's your turn!

I hope you've enjoyed this tutorial and feel a bit more confident to utilise your own Google Analytics data in BigQuery. Drop a line in the comments to share your own approach or ideas for expanding this analysis, or if you have any questions, feedback or suggestions related to this article.

Reclaim a significant amount of your valuable time by subscribing to our premium membership. You'll gain immediate access to 23 similar tutorials with SQL code and 7 ready-to-go GA4 report queries.

Subscribe to Premium