Tutorial: How to combine public global weather and Google Analytics data in BigQuery (GA4 / UA)
10 min read

Tutorial: How to combine public global weather and Google Analytics data in BigQuery (GA4 / UA)

In this tutorial I will show you - step by step - how to combine weather data from the public BigQuery 'noaa_gsod' set with your own Google Analytics data.
Tutorial: How to combine public global weather and Google Analytics data in BigQuery (GA4 / UA)

Google Analytics is great for measuring web and app behaviour, but it can't tell you anything about contextual events that happen in the real world. Combining your website and/or app data with public available data can be of great value to your business.

A good example of contextual data that can be relevant to you or your business is weather data. BigQuery's public data program provides exactly what we need.

Exploration

Let's take a look at the public noaa_gsod global weather data set.

This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present, collected from over 9000 stations. Update frequency: daily.

This is an incredible source of information. We will explore this data set so we can use it for our purpose: combining it with GA data.

Weather stations

Before we dive in the actual weather data, we need to choose the weather station(s) we want to be the source(s) of our weather data. As weather data can differ from continent and country to another, but also can be totally different between regions and cities, we have to spent some time looking at the schema bigquery-public-data.noaa_gsod.stations first.

Remember, the data set contains weather data from 1929 onwards, so probably a lot of weather stations will not be active any more. Using the following query we can define how many stations worldwide are still active during the last week, grouped by country.

select
    country,
    count(distinct usaf) as stations
from
    `bigquery-public-data.noaa_gsod.stations` as stations
where
    stations.end >= format_date('%Y%m%d',date_sub(current_date(), interval 7 day))
group by
    country
order by
    stations desc

As a Dutch citizen I'm looking for a station that is representative for the Netherlands. I know the the Royal Netherlands Meteorological Institute (KNMI) is based in De Bilt.

select
    *
from
    `bigquery-public-data.noaa_gsod.stations` as stations
where
    stations.end >= format_date('%Y%m%d',date_sub(current_date(), interval 7 day))
    and name like '%BILT%'

The station I need is known by the usaf identifier 062600. With this id we can query the weather data.

Note: if you want to combine data from multiple stations, that's also possible, but will complicate the joining process later on. You'll end up with multiple rows per day from the weather table that you have to join with one row per day of Google Analytics data. Averaging the results of the multiple stations before joining with the GA data is a possible solution, but for large countries with different climates this will not make a lot of sense.

Weather data schema

To get a better view on the data available, let's look at the schema of the most recent year table at the time of writing, bigquery-public-data.noaa_gsod.gsod2020:

Field name Type Description
stn STRING Cloud - GSOD NOAA
wban STRING WBAN number where applicable--this is the historical "Weather Bureau Air Force Navy" number - with WBAN being the acronym
date DATE Date of the weather observations
year STRING The year
mo STRING The month
da STRING The day
temp FLOAT Mean temperature for the day in degrees Fahrenheit to tenths. Missing = 9999.9
count_temp INTEGER Number of observations used in calculating mean temperature
dewp FLOAT Mean dew point for the day in degreesm Fahrenheit to tenths. Missing = 9999.9
count_dewp INTEGER Number of observations used in calculating mean dew point
slp FLOAT Mean sea level pressure for the day in millibars to tenths. Missing = 9999.9
count_slp INTEGER Number of observations used in calculating mean sea level pressure
stp FLOAT Mean station pressure for the day in millibars to tenths. Missing = 9999.9
count_stp INTEGER Number of observations used in calculating mean station pressure
visib FLOAT Mean visibility for the day in miles to tenths. Missing = 999.9
count_visib INTEGER Number of observations used in calculating mean visibility
wdsp STRING Mean wind speed for the day in knots to tenths. Missing = 999.9
count_wdsp STRING Number of observations used in calculating mean wind speed
mxpsd STRING Maximum sustained wind speed reported for the day in knots to tenths. Missing = 999.9
gust FLOAT Maximum wind gust reported for the day in knots to tenths. Missing = 999.9
max FLOAT Maximum temperature reported during the day in Fahrenheit to tenths--time of max temp report varies by country and region, so this will sometimes not be the max for the calendar day. Missing = 9999.9
flag_max STRING Blank indicates max temp was taken from the explicit max temp report and not from the 'hourly' data. * indicates max temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature)
min FLOAT Minimum temperature reported during the day in Fahrenheit to tenths--time of min temp report varies by country and region, so this will sometimes not be the min for the calendar day. Missing = 9999.9
flag_min STRING Blank indicates min temp was taken from the explicit min temp report and not from the 'hourly' data. * indicates min temp was derived from the hourly data (i.e., lowest hourly or synoptic-reported temperature)
prcp FLOAT Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths; will usually not end with the midnight observation--i.e., may include latter part of previous day. .00 indicates no measurable precipitation (includes a trace). Missing = 99.99 Note: Many stations do not report '0' on days with no precipitation--therefore, '99.99' will often appear on these days. Also, for example, a station may only report a 6-hour amount for the period during which rain fell. See Flag field for source of data
flag_prcp STRING A = 1 report of 6-hour precipitation amount B = Summation of 2 reports of 6-hour precipitation amount C = Summation of 3 reports of 6-hour precipitation amount D = Summation of 4 reports of 6-hour precipitation amount E = 1 report of 12-hour precipitation amount F = Summation of 2 reports of 12-hour precipitation amount G = 1 report of 24-hour precipitation amount H = Station reported '0' as the amount for the day (eg, from 6-hour reports), but also reported at least one occurrence of precipitation in hourly observations--this could indicate a trace occurred, but should be considered as incomplete data for the day. I = Station did not report any precip data for the day and did not report any occurrences of precipitation in its hourly observations--it's still possible that precip occurred but was not reported
sndp FLOAT Snow depth in inches to tenths--last report for the day if reported more thanonce. Missing = 999.9 Note: Most stations do not report '0' ondays with no snow on the ground--therefore, '999.9' will often appear on these days
fog STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
rain_drizzle STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
snow_ice_pellets STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
hail STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
thunder STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
tornado_funnel_cloud STRING Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day

Basic query

I'm mainly interested in basic concepts like temperature, wind, rain and snow, so I'll start with a basic query to pull some first data in.

But let's define a time frame first. In this case I'll take a range with a static start and dynamic end date.

with date_range as (
select
    date('2019-12-01') as start_date,
    date_sub(current_date(), interval 1 day) as end_date)

Next, add a source for our data. To be able to query multiple day tables at once we use a wildcard * in the table name. To make the wildcard work we need to add a _table_suffix and refer to the tables we want to query, using a cross join (in short:,) after our weather data table. In this example BigQuery will only use the schemas gsod2019 and onwards.

with date_range as (
select
    date('2019-12-01') as start_date,
    date_sub(current_date(), interval 1 day) as end_date)
    
select
    *
from
    `bigquery-public-data.noaa_gsod.gsod*`,
    date_range
where
    _table_suffix between cast(extract(year from start_date) as string) and cast(extract(year from end_date) as string)

Next, we only want to return data for the dates in our date_range (in this example only the last month from gsod2019 and everything up to the end_date) and only for the weather station(s) we selected.

with date_range as (
select
    date('2019-12-01') as start_date,
    date_sub(current_date(), interval 1 day) as end_date)
    
select
    *
from
    `bigquery-public-data.noaa_gsod.gsod*`,
    date_range
where
    _table_suffix between cast(extract(year from start_date) as string) and cast(extract(year from end_date) as string)
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) >= start_date
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) <= end_date
    and stn = '062600'

Well done! Now you got all results for the weather station(s) you selected for the desired time frame.

Optional: convert data to metric system

As you can see from the query results by now the noaa_gsod schema contains USA style customary units like degrees Fahrenheit, miles and inches. As most other countries use the metric system instead, I'll convert the columns I want to keep from our earlier select * to degrees Celcius and kilo/millimeters and rename them to add some context.

with date_range as (
select
    date('2019-12-01') as start_date,
    date_sub(current_date(), interval 1 day) as end_date)

select
    date(cast(year as int64),cast(mo as int64),cast(da as int64)) as date,
    case when temp = 9999.9 then null else round((temp - 32) * (5/9),1) end as temperature_celcius_mean,
    case when max = 9999.9 then null else round((max - 32) * (5/9),1) end as temperature_celcius_max,
    case when min = 9999.9 then null else round((min - 32) * (5/9),1) end as temperature_celcius_min,
    case when visib = 999.9 then null else round(visib * 1.609,1) end as visibility_km,
    case when wdsp = '999.9' then null else round(cast(wdsp as float64) * 1.852,1) end as wind_speed_kmh,
    case when prcp = 99.99 then null else round(prcp * 25.4,1) end as precipitation_mm,
    case when sndp = 999.9 then 0.0 else round(sndp * 25.4,1) end as snow_depth_mm,
    fog,
    rain_drizzle,
    hail,
    thunder
from
    `bigquery-public-data.noaa_gsod.gsod*`,
    date_range
where
    _table_suffix between cast(extract(year from start_date) as string) and cast(extract(year from end_date) as string)
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) >= start_date
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) <= end_date
    and stn = '062600'
order by
    date desc
Note: for some reason the column wdsp contains string values, therefore we have to convert it to float values.

Combine weather data with Google Analytics data

Now we have prepared our weather data, it is time to join it with Google Analytics data. This works for both GA4 and UA. I'll provide a basic combination for both GA versions below.

As you can see the data types of the both date fields are different (string in GA, date in weather), therefore we format weather.date to string to be able to join them.

Note: when your Google Analytics data is stored in a non-US region, you might have to copy the weather data to your own region using this feature (choose 'copy dataset' for a one-off copy, use data transfer for a scheduled copy) to be able to join data from the two data sets.

Example query: Google Analytics 4

with date_range as (
select
    date('2019-12-01') as start_date,
    date_sub(current_date(), interval 1 day) as end_date),
    
weather as (
select
    date(cast(year as int64),cast(mo as int64),cast(da as int64)) as date,
    case when temp = 9999.9 then null else round((temp - 32) * (5/9),1) end as temperature_celcius_mean,
    case when max = 9999.9 then null else round((max - 32) * (5/9),1) end as temperature_celcius_max,
    case when min = 9999.9 then null else round((min - 32) * (5/9),1) end as temperature_celcius_min,
    case when visib = 999.9 then null else round(visib * 1.609,1) end as visibility_km,
    case when wdsp = '999.9' then null else round(cast(wdsp as float64) * 1.852,1) end as wind_speed_kmh,
    case when prcp = 99.99 then null else round(prcp * 25.4,1) end as precipitation_mm,
    case when sndp = 999.9 then 0.0 else round(sndp * 25.4,1) end as snow_depth_mm,
    fog,
    rain_drizzle,
    hail,
    thunder
from
    `bigquery-public-data.noaa_gsod.gsod*`,
    date_range
where
    _table_suffix between cast(extract(year from start_date) as string) and cast(extract(year from end_date) as string)
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) >= start_date
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) <= end_date
    and stn = '062600'
order by
    date desc)
  
select
    events.event_date,
    count(distinct case when event_name = 'session_start' then concat(user_pseudo_id, cast(event_timestamp as string)) end) as sessions,
    count(distinct ecommerce.transaction_id) as transactions,
    ifnull(sum(ecommerce.purchase_revenue),0) as purchase_revenue,
    count(distinct ecommerce.transaction_id) / count(distinct case when event_name = 'session_start' then concat(user_pseudo_id, cast(event_timestamp as string)) end) as ecommerce_conversion_rate,
    max(weather.temperature_celcius_mean) as temperature_celcius_mean,
    max(weather.precipitation_mm) as precipitation_mm,
    max(weather.wind_speed_kmh) as wind_speed_kmh
from
    `ga4bigquery.analytics_250794857.events_*` as events,
    date_range
    left join weather on events.event_date = format_date('%Y%m%d',weather.date)
where
    _table_suffix between format_date('%Y%m%d',date_range.start_date) and format_date('%Y%m%d',date_range.end_date)
group by
    event_date
order by
    event_date desc

Example query: Universal Analytics

with date_range as (
select
    date('2016-08-01') as start_date,
    date('2017-03-01') as end_date),
    
weather as (
select
    date(cast(year as int64),cast(mo as int64),cast(da as int64)) as date,
    case when temp = 9999.9 then null else round((temp - 32) * (5/9),1) end as temperature_celcius_mean,
    case when max = 9999.9 then null else round((max - 32) * (5/9),1) end as temperature_celcius_max,
    case when min = 9999.9 then null else round((min - 32) * (5/9),1) end as temperature_celcius_min,
    case when visib = 999.9 then null else round(visib * 1.609,1) end as visibility_km,
    case when wdsp = '999.9' then null else round(cast(wdsp as float64) * 1.852,1) end as wind_speed_kmh,
    case when prcp = 99.99 then null else round(prcp * 25.4,1) end as precipitation_mm,
    case when sndp = 999.9 then 0.0 else round(sndp * 25.4,1) end as snow_depth_mm,
    fog,
    rain_drizzle,
    hail,
    thunder
from
    `bigquery-public-data.noaa_gsod.gsod*`,
    date_range
where
    _table_suffix between cast(extract(year from start_date) as string) and cast(extract(year from end_date) as string)
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) >= start_date
    and date(cast(year as int64),cast(mo as int64),cast(da as int64)) <= end_date
    and stn = '062600'
order by
    date desc)
  
select
    sessions.date,
    count(distinct concat(sessions.fullvisitorid, cast(sessions.visitstarttime as string))) as sessions,
    ifnull(sum(sessions.totals.transactions),0) as transactions,
    ifnull(sum(sessions.totals.totaltransactionrevenue),0)/1000000 as revenue,
    ifnull(sum(sessions.totals.transactions) / count(distinct concat(sessions.fullvisitorid, cast(sessions.visitstarttime as string))),0) as ecommerce_conversion_rate,
    max(weather.temperature_celcius_mean) as temperature_celcius_mean,
    max(weather.precipitation_mm) as precipitation_mm,
    max(weather.wind_speed_kmh) as wind_speed_kmh
from
    `bigquery-public-data.google_analytics_sample.ga_sessions_*` as sessions,
    date_range
    left join weather on sessions.date = format_date('%Y%m%d',weather.date)
where
    _table_suffix between format_date('%Y%m%d',date_range.start_date) and format_date('%Y%m%d',date_range.end_date)
group by
    date
order by
    date desc

Now it's your turn!

I hope you've enjoyed this tutorial and feel a bit more confident to utilise your own Google Analytics data in BigQuery. Please let me know if you have any feedback or suggestions to improve the quality of the content.


Enjoying this post? Subscribe to the GA4BigQuery newsletter!

You will receive regular updates (max 1x/week) straight in your inbox. You can unsubscribe anytime.