Let's start with some bad news: your Google Analytics reports are not necessarily a representation of reality. Let's assume you have the perfect implementation and can measure all behaviour on your website in detail (which is nearly impossible). Your GA reports would still suck big time. Some of the reasons:
- users using and/or sharing multiple devices
- usage of ad blockers
- browsers implementing tracking protection mechanisms
- arbitrary session timeout settings
- sampling in reports
With this in mind, it is safe to conclude that Google Analytics is best used as a tool to spot trends. It can handle large amounts of behavioural website data and it enables you - when interpreted in the right way - to catch the signal between all the noise.
BigQuery results vs Google Analytics reports
Ok, now you're querying the GA export data in BigQuery and when you compare the results with the GA user interface, the numbers just don't add up. How is this possible?
First of all: don't worry about it too much. As we've concluded earlier, Google Analytics data is not exactly metaphysics. However, there are some things to keep in mind while querying GA data in BigQuery.
The BigQuery export schema contains raw data that is collected and processed by Google Analytics. When you are used to GA reports, you'll soon find out that a lot of metrics don't exist here. You'll have to calculate them based on the definitions that Google provides. Sometimes these are crystal clear, but in a lot of cases the documentation is inconsistent or just absent. The upside is that you have the freedom to create your own definitions. Why use the default metrics when you can customise everything?
Dealing with data on different levels can be hard. In the Universal Analytics export you'll have to wrangle user, session, hit and product scope data. In GA4 there is user, event and item data to worry about. You'll need to have a basic understanding of how these scopes relate to each order, to be able to generate meaningful results. Remember: BigQuery won't tell you if your results are incorrect, as long as your query is valid.
'Show me your queries and I'll tell you who you are'. Even when all definitions are clear and you completely understand the ins and outs of scopes in Google Analytics, the results can differ depending on the author of the query. A rule of thumb: keep it simple, stupid. In some cases, however, this is easier said than done.
Examples of GA4 specifics
- Google Analytics 4 reports use a definition of active users (users who are currently engaged)
- When replicating the session source/medium report in the GA4 user interface, it is nearly impossible to get the numbers to match. There is no source / medium information in the
session_startevent. Also the session attribution used in the GA4 user interface is still a black box.
Examples of UA specifics
- Make sure to use
totals.visits = 1in your
whereclause to exclude sessions without interactions as by default and opposing to the Google Analytics user interface, the BigQuery export tables contains all sessions, including the ones without interactions.
- If you use
count(distinct concat(fullvisitorid, cast(visitstarttime as string)))to count sessions, you don't have to worry about midnight-split-sessions.
- UA estimates users using an algorithm called HyperLogLog++, BigQuery (and GA unsampled reports) will give you the exact number.
I'll add more examples when I come across. If you have suggestions, let me know.
Enjoying this post? Subscribe to the GA4BigQuery newsletter!
You will receive regular updates (max 1x/week) straight in your inbox. You can unsubscribe anytime.