Dataform for BigQuery: A basic end-to-end guide

Dataform is a powerful tool for managing your data workflows in a structured, version-controlled, and automated way. Whether you’re a beginner or an experienced data engineer, Dataform simplifies SQL-based transformations while integrating seamlessly with Google BigQuery.

Although this blog offers a basic introduction to Dataform’s functionality, users can achieve significantly more with Dataform. From advanced scheduling, parameterised queries, and dependency management to complex data modelling, the possibilities are vast.

We will be releasing more blogs diving deeper into various aspects of Dataform, covering best practices, optimisation techniques, and real-world use cases. This blog will also serve as a live blog, where we will continue adding new features, updates, and linking to detailed Dataform-related articles as they become available.

Make sure to bookmark this blog and stay tuned for upcoming posts that will help you leverage Dataform to its full potential!

Now, we’ll take you through an end-to-end setup of Dataform, starting from creating a workspace to building target tables, while exploring all its key features along the way.

Setting up your Dataform workspace

Before we dive into defining sources and building tables, let’s set up our Dataform workspace.

1: Create a Google Cloud project

  1. Go to Google Cloud Console.
  2. Create a new project or select an existing one.
  3. Enable the BigQuery API.

2: Enable Dataform in Google Cloud

  1. Navigate to Google Cloud Dataform.
  2. Click Create Repository and give it a name.
  3. Choose a location (typically the same as your BigQuery dataset).
  4. Initialise the repository with a new workspace.

3: Configure Dataform

After setting up your Dataform workspace, it’s important to configure the workflow_settings.yaml file. This file defines the default project, location, datasets, and Dataform core version for your workspace.

Example workflow_settings.yaml configuration:

defaultProject: dataform-64568

defaultLocation: EU

defaultDataset: dataform

defaultAssertionDataset: dataform_assertions

dataformCoreVersion: 3.0.0Code language: CSS (css)

This configuration lets you execute all transformations and assertions within the specified project and datasets, streamlining your workflow management.

Defining your data sources

Now that your workspace is set up, let’s define the source dataset and tables. We’ll use a free BigQuery dataset for demonstration.

1: Choose a public dataset

Google provides a variety of public datasets. Let’s use bigquery-public-data.london_bicycles.cycle_hire.

2: Define the source in Dataform

  • Delete all other folders and files except the definitions/ folder and workflow_settings.yaml file.
  • Create a new directory called source/ under definitions/.
  • Define the source tables under the definitions/source/ directory.

1. Bike Hire Source Data – cycle_hire.sqlx

config {

  type: "declaration",

  database: "bigquery-public-data",

  schema: "london_bicycles",

  name: "cycle_hire",

  description: "Cycle Hire Data"

}Code language: CSS (css)

2. Bike stations source data – cycle_stations.sqlx

config {

  type: "declaration",

  database: "bigquery-public-data",

  schema: "london_bicycles",

  name: "cycle_stations",

  description: "Cycle Stations Data"

}Code language: CSS (css)

Creating transformations

Let’s create transformations that will process and refine our London bicycle hire dataset. We will build intermediate tables, apply multiple transformations, and finalise the results into target tables.

Step 1: Create intermediate tables

Intermediate tables help us clean and enrich the raw data before final aggregation. These should be placed under the definitions/intermediate/ directory.

1. Normalise Bike Ride Durations

config {

  type: "table",

  schema: "intermediate",

  name: "normalized_cycle_rides"

}

-- In the above config, if schema is not mentioned the table will be created under default dataset mentioned in the workflow_settings.yml.

-- If the name is not mentioned then the table will be created as same as the file name. It is good practice to keep the tables names same as your file names.

SELECT

  rental_id,

  bike_id,

  start_station_id,

  end_station_id,

  TIMESTAMP_DIFF(end_date, start_date, SECOND) AS ride_duration_seconds,

  start_date,

  end_date

FROM ${ref("cycle_hire")}Code language: PHP (php)

Step 2. Aggregate Daily Bike Usage

config {

  type: "table",

  schema: "intermediate",

  name: "daily_cycle_usage"

}

SELECT

  DATE(start_time) AS ride_date,

  bike_id,

  COUNT(*) AS total_rides

FROM ${ref("normalized_cycle_rides")}

GROUP BY 1, 2Code language: PHP (php)

Step 3: Define final output tables

Final tables consolidate the transformed data for reporting and analysis. These should be placed under the definitions/output/ directory.

1. Summary Report

config {

  type: "table",

  schema: "output",

  name: "summary_report"

}

SELECT

  d.ride_date,

  d.bike_id,

  d.total_rides,

  s.station_name AS start_station,

  e.station_name AS end_station

FROM ${ref("daily_cycle_usage")} d

LEFT JOIN ${ref("cycle_stations")} s ON d.start_station_id = s.station_id

LEFT JOIN ${ref("cycle_stations")} e ON d.end_station_id = e.station_idCode language: PHP (php)

2. Peak Ride Hours

config {

  type: "view",

  schema: "output",

  name: "peak_ride_hours"

}

SELECT

  EXTRACT(HOUR FROM start_date) AS hour_of_day,

  COUNT(*) AS ride_count

FROM ${ref("normalized_cycle_rides")}

GROUP BY 1

ORDER BY ride_count DESCCode language: PHP (php)

Instead of storing the data physically, we can create views for flexible and cost-efficient querying.


Incremental tables are another type of table that can be defined in Dataform. To learn more about these tables and their uses, please refer to this blog.


The “Not found: Dataset <project_id>: intermediate was not found in location” error message appears in the Dataform UI only when the underlying tables have not yet been created. The error will disappear once the pipeline has been executed and the tables have been created.

4. Partitioning and clustering in Dataform

Both of these help optimise query performance and reduce costs in BigQuery by improving how data is stored and accessed.

Partitioning

Partitioning divides a table into segments based on a specific column (e.g., ride_date). Queries that filter by this column will scan only the relevant partitions, improving efficiency.

Clustering

Clustering organises data within partitions based on certain columns (e.g., bike_id and start_station). This speeds up queries that filter or group by these columns.

By implementing partitioning and clustering in our summary_report table, we can ensure efficient storage and faster query execution. This is particularly beneficial for queries that filter on a specific date, as it reduces the amount of data scanned in BigQuery, leading to cost savings. Let’s implement partitioning and clustering in the summary_report table as shown below.

config {

  type: "table",

  schema: "output",

  name: "summary_report",

  bigquery: {

    partitionBy: "ride_date",

    clusterBy: ["bike_id", "start_station"]

  }

}

SELECT

  d.ride_date,

  d.bike_id,

  d.total_rides,

  s.station_id AS start_station,

  e.station_id AS end_station

FROM ${ref("daily_cycle_usage")} d

LEFT JOIN ${ref("station_usage")} s ON d.bike_id = s.station_id

LEFT JOIN ${ref("station_usage")} e ON d.bike_id = e.station_idCode language: PHP (php)

5. Compile and run transformations

Now that we have defined our transformations and output tables, we need to execute them in Dataform to create the required tables and views in BigQuery.

Executing Multiple Actions

  • Click Start Execution to validate your SQL code and check for errors.
  • Navigate to the Actions section in Dataform.
  • Click on Multiple Actions… to selectively run the required transformations.
  • Choose the following actions to be executed:
    • bike_usage_summary
    • popular_routes
  • Select option Include Dependencies
  • Click Run Selected Actions to execute and create these tables in your BigQuery dataset.

By selecting multiple actions, we ensure that both tables are created efficiently while adhering to dependencies between intermediate and output tables.

6. Customising Dataform structure

This blog uses a sample structure to simplify how Dataform works. You can adapt Dataform’s structure to your project’s specific needs after grasping the fundamentals. Depending on the complexity of data workflows, teams may choose to organise their Dataform repository differently, optimise queries for efficiency, or integrate additional features like parameterised datasets and advanced scheduling mechanisms. You can tailor Dataform’s flexibility to meet various data engineering requirements.

Share:
Written by

Prasanna is a Senior Analytics Engineer at Measurelab. With a passion for building scalable and efficient data solutions, he is dedicated to making data more accessible and actionable for businesses. Recently, he has been exploring Machine Learning and AI to enhance data engineering processes and drive deeper insights. As a certified Google Cloud and AWS Data Engineer, Prasanna brings expertise in cloud-based data architectures, automation, and advanced analytics to help organizations unlock the full potential of their data.

Subscribe to our newsletter: