Data pipeline optimisation with Google Cloud and Dataform
In our recent engagement with a client, we went on a journey to transform their data pipelines, tackling inefficiencies in performance and cost within their Google Cloud BigQuery environment. Our efforts culminated in a comprehensive optimisation strategy that used Dataform, improved SQL practices, and implemented tailored solutions for significant performance gains and cost savings. Here’s a deep dive into the highlights of our project.
Identifying inefficiencies in BigQuery workflows
We began by analysing the existing data architecture, identifying key areas of inefficiency:
- Long view chains: Multiple layers of views increased complexity and degraded performance.
- Lack of incremental logic: Transformations processed entire datasets unnecessarily, increasing costs.
- Unoptimised tables: Absence of partitioning and clustering led to full table scans.
Transitioning to Dataform
To address these challenges, we transitioned from BigQuery Scheduled Queries to Dataform, unlocking the following benefits:
- Improved maintainability: Modular SQL scripts made workflows easier to debug and maintain.
- Version control: Seamless integration with Git allowed better collaboration and change tracking.
- Automated dependency management: Dependencies were automatically managed, minimising manual intervention and errors.
- Incremental processing capabilities: Focused data processing only on updated data, enhancing efficiency.
- Assertions for data quality: Automated checks ensured consistency and reliability throughout the pipeline.
Key optimisation strategies
- Query structure improvements: Refactored subqueries into Common Table Expressions (CTEs), improving readability and debugging efficiency.
- Consolidation of intermediate tables: Combined redundant views into fewer tables, reducing complexity and redundant operations.
- Implementation of incremental tables: Processed only new or updated data, significantly reducing data scanned and associated costs.
- Partitioning of tables: Partitioned tables based on relevant date fields, improving query performance.
- Eliminating full table scans: Restructured incremental tables to prevent full table scans, ensuring more targeted data processing.
Cost savings analysis
Our optimisation efforts translated into substantial cost savings:
- Data processed for key tables:
- BigQuery costs were reduced by over 40% annually due to a 92% reduction in data processed for some large output tables.
- Dashboard optimisation:
- Dashboard redesigns also contributed to query optimisation. By separating lightweight and heavy charts onto different pages and consolidating scorecards, redundant queries and unnecessary filters were eliminated. This resulted in a substantial reduction – hundreds of GBs per dashboard load – from the final output tables.
- Operational efficiencies:
- Reduced manual dependency management, debugging, and new model setup, saving approximately 32 engineer hours monthly, saving close to 20% in cost per month.
Conclusion
This project demonstrates how targeted optimisations can transform data pipelines, improving performance while dramatically reducing costs. Leveraging tools like Dataform and best practices in SQL and BigQuery, we delivered a smarter, more efficient solution tailored to the client’s needs.
Resources
- Read our blog post on Dataform for BigQuery: A basic end-to-end guide.
- Watch our Behind the Cloud series on YouTube.