How to flatten tables with user-defined functions in BigQuery
As we all know Google is continuously improving and expanding the Google Cloud Platform and BigQuery has truly benefited from this. It’s taken a simple scalable data warehouse into what is now a comprehensive and dynamic system. As you may guess, I’m a massive fan of BigQuery and it’s user-defined functions.
BigQuery offers a huge amount of functionality including Data Definition Language (DDL) and Data Manipulation Language (DML), however, sometimes there simply isn’t the function to help you complete what you are doing. Often in the past, I would end up accessing BigQuery via Python or SAS and applying the missing function.
User-Defined Functions (UDFs) cover some of the current gaps, and since they are now persistent (I’ll explain this further later), you can share and re-use these with your colleagues.
What is a UDF?
A UDF simply is a function you create that applies a logic not available within the standard SQL function to your data. Generally, you will be using JavaScript, SQL, Simple Logic or base Python (albeit in a rather hacky approach).
How do you generate UDFs?
Using BigQuery’s comprehensive guides, you can see the general structure is very similar to some DDL statements, however, it can take a little getting used to. I go into this in more detail over on Medium with examples.
How to flatten tables with UDFs
Imagine a scenario where you have a Google Analytics 4 (GA4) or Firebase (FB) dataset where you need to flatten multiple fields. Normally you would need to write out the standard UNNEST script of:
(SELECT VALUE.INT_VALUE FROM UNNEST(EVENT_PARAMS) WHERE KEY = 'CCC')
Code language: JavaScript (javascript)
However, that is super painful to repeat multiple times and likely to produce errors. So using a UDF, scripting and the fab SQL execute statement, I have generated the code below which will automatically UNNEST your GA4 data.
Be careful to select the specific dates and events you wish to review. You can even remove the SQL execute and simply copy the query into another window or a tool such as Python.
DECLARE
SQLRUN STRING DEFAULT '';
CREATE TEMP FUNCTION
ga4_firebase( key1 STRING,
params ARRAY <STRUCT <key STRING,
value STRUCT <string_value STRING,
int_value INT64,
float_value FLOAT64,
double_value FLOAT64 >>>) AS ( (
SELECT
param.value
FROM
UNNEST(params) param
WHERE
param.key=key1) );
# *****************************************
# Set the tables names and define row numbers and update times
# *****************************************
SET
SQLRUN =
(
SELECT
CONCAT('SELECT EVENT_NAME , ', STRING_AGG(CONCAT("ga4_firebase('",key,"', event_params).", event_parameter_value, " AS ", key, ',' ), '\n'), ' FROM `project.analytics_yyyymmdd.events_*` WHERE _TABLE_SUFFIX = "20210801" ORDER BY 1' )
FROM (
SELECT
DISTINCT event_name,
params.key AS key,
(CASE
WHEN params.value.string_value IS NOT NULL THEN 'string_value'
WHEN params.value.int_value IS NOT NULL THEN 'int_value'
WHEN params.value.double_value IS NOT NULL THEN 'double_value'
WHEN params.value.float_value IS NOT NULL THEN 'float_value'
END
) AS event_parameter_value,
FROM
`project.analytics_yyyymmdd.events_*`,
UNNEST(event_params) AS params
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) )
AND EVENT_NAME = 'page_view'
);
# *************************************
# Execute the query
# *************************************
EXECUTE IMMEDIATE
(SQLRUN);
Code language: PHP (php)
Saving and persisting UDFs
I mentioned earlier that recently Google introduced persistent UDFs, ones that can be saved in your database and this is relatively simple to do. You simply need a destination dataset and the name of your UDF. This will then save in your dataset and include a function flag to make them easy to spot. You can then see the function values when you click on them and apply them to any code using the dataset.function_name.
CREATE FUNCTION UDFS.ga4_firebase
Code language: CSS (css)
In fact, a great UDF community has formed an open-sourced UDF project which you can actually apply to your own work, including many statistical techniques such as P-Values and much more simply by using the following syntax:
SELECT bqutil.fn.udf_name(variable)
Code language: CSS (css)
Be sure to check them out!