Automating BigQuery workflows with conditional logic and data readiness checks
BigQuery offers powerful scripting capabilities that can simplify complex workflows. In this post, we’ll explore two essential techniques: using IF…THEN…END IF for procedural control and leveraging the metadata in project.dataset.__TABLES__
to ensure your Google Analytics 4 (GA4) data is ready before running queries. These strategies help you avoid unnecessary processing, reduce costs, and improve efficiency in your data pipelines.
1. Leveraging IF…THEN…END IF for procedural logic
While many SQL developers rely on the CASE WHEN
expression for conditional logic within a SELECT
statement, BigQuery’s scripting environment allows for more dynamic control with IF…THEN…END IF. This control-flow statement is especially useful in stored procedures and scripts where you need to:
- Define and use variables
- Dynamically set values
- Control the execution flow based on complex conditions
Example: Running code on weekdays
The following example demonstrates how to execute a block of code only on weekdays:
DECLARE current_day STRING;
BEGIN
-- Get today's day of the week
SET current_day = FORMAT_TIMESTAMP('%A', CURRENT_TIMESTAMP());
-- Only run the query if it's a weekday
IF current_day NOT IN ('Saturday', 'Sunday') THEN
SELECT 'Running weekday report...' AS status;
ELSE
SELECT 'Skipping execution. It is the weekend.' AS status;
END IF;
END;
Code language: PHP (php)
Benefits of Using IF...THEN...END IF:
- Procedural Logic: Unlike
CASE WHEN
, this syntax allows for multi-step procedural operations. - Dynamic Execution: It lets you perform different actions based on variable conditions.
- Improved Control: Ideal for scenarios such as dynamic table updates, looping constructs, or exception handling.
2. Ensuring GA4 Data readiness with modified time checks
When working with GA4 exports in BigQuery, timing is critical. GA4 tables (like events_YYYYMMDD
) may not arrive at a predictable time each day, and rerunning the same transformation repeatedly is not only inefficient but can also lead to duplicate processing.
The challenge
- Uncertain Arrival Times: GA4 data can be delayed, so a table might not be complete when your query runs.
- Avoiding Redundancy: You want to process data only once per day, preventing unnecessary cost and load.
The Solution: Querying table metadata
BigQuery provides a system view (project.dataset.__TABLES__
) that contains metadata for each table, including the last_modified_time
. By querying this view, you can determine whether a table has been updated recently and decide if your processing should run.
Step-by-Step implementation
Step 1: Check if the Destination Table Was Updated Today
Before running your main query, check if the destination table has already been refreshed:
DECLARE alreadyRanToday BOOL;
SET alreadyRanToday = (
SELECT COUNT(*) > 0
FROM `project.dataset.__TABLES__`
WHERE table_id = 'table_you_are_creating'
AND DATE(TIMESTAMP_MILLIS(last_modified_time)) = DATE(CURRENT_DATE())
);
- Purpose: This condition prevents the query from running multiple times in one day.
Step 2: Verify the GA4 table’s readiness
Next, ensure that yesterday’s GA4 table exists and was modified recently. This step protects against processing incomplete or missing data.
DECLARE ga4TableExist BOOL;
SET ga4TableExist = (
SELECT COUNT(*) > 0
FROM `project.dataset.__TABLES__`
WHERE table_id = CONCAT('events_', FORMAT_DATE('%Y%m%d', DATE_SUB(DATE(CURRENT_DATE()), INTERVAL 1 DAY)))
AND TIMESTAMP_DIFF(
TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 60 MINUTE),
TIMESTAMP_ADD(TIMESTAMP_MILLIS(last_modified_time), INTERVAL 60 MINUTE),
MINUTE
) > 10
);
- Explanation: By ensuring that the table was modified within the last 10 minutes, you safeguard against querying an incomplete dataset.
Step 3: Execute the main query conditionally
Finally, combine these checks to run your query only when both conditions are met:
IF NOT alreadyRanToday THEN
IF ga4TableExist THEN
SELECT *
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX = DATE_SUB(DATE(CURRENT_DATE()), INTERVAL 1 DAY);
END IF;
END IF;
Key Advantages:
- Prevents Unnecessary Processing: The script only executes when needed.
- Ensures Data Accuracy: Queries run only when the GA4 data is confirmed to be complete.
- Cost Efficiency: Scheduled checks (e.g., every 60 minutes) incur no cost if the conditions aren’t met.
Understanding project.dataset.__TABLES__
for metadata queries
The project.dataset.__TABLES__
view is an invaluable resource for managing your BigQuery workflows. Here’s why:
- Table Metadata Access: This view stores important information such as
table_id
andlast_modified_time
. - Dynamic Decision-Making: By querying this view, you can build logic into your scripts that adapts to the state of your data.
- Automation Friendly: It allows you to programmatically check the status of your tables and conditionally execute further processing, a critical aspect of automated data pipelines.
Use cases:
- Verifying Data Freshness: Ensure that your queries run only after the data has been fully updated.
- Avoiding Redundancy: Prevent reprocessing of data that has already been handled within the same day.
- Optimizing Resource Usage: Reduce unnecessary query executions, saving both time and money.
For other types of information_schema that are available in BigQuery, you can have a look at this google documentation.
Conclusion
Combining the procedural flexibility of IF…THEN…END IF with the strategic use of metadata from project.dataset.__TABLES__
provides a robust framework for automating BigQuery workflows. This approach not only streamlines your operations but also ensures data accuracy and cost efficiency.
By integrating these advanced techniques, you can optimise your data pipelines and stay ahead in your data processing tasks. Let’s discuss!