Using Python in Google Cloud Functions to automate Data Import into Google Analytics from BigQuery
A few months back I was granted access to a Google Cloud Source Repository (private Git repositories hosted on Google Cloud part of Google’s Cloud developer tools) that contained Python code and configuration details for automating the extraction of data from a BigQuery table and uploading to Google Analytics as a custom dimension via the Management API on a desired schedule.
Being primarily a JavaScript-focussed individual I thought this was a good opportunity to expand my knowledge and work on something a bit different to my usual Google Tag Manager shenanigans. The only problem, however, was coming up with a suitable scenario where I could make use of this repository—thankfully, while discussing the potential applications of this code, a memory surfaced of a previous client request; namely, to collect the word count on the client’s blog articles so that they were available in a custom dimension.
This request went round to a few Measurelabbers who each came up with a different approach. My initial suggestion was to use GTM (naturally) to scrape the page in a custom JS variable, calculate a word count and pass on the value in a custom dimension field. However, this necessitates having to run a bunch of additional JS code in a user’s browser and potentially delaying the page view tag to only fire once the visible content has loaded.
Another suggested approach was to get the client to compile each word count with each page as a CSV and use the Data Import feature in Google Analytics. However, that approach requires repetitive manual input from the client every time a new blog is added. Yet another suggestion was uploading the data to Google Cloud Storage or BigQuery and handling the data that way. This Google Cloud Solution offered (at least in part) a way to get round these obstacles.
With this problem in mind, I decided a good way to utilise this repo would be to to set up a word count custom dimension for Measurelab’s own blog. Now that I had an objective, it was time to get started.
First of all, I needed to get hold of the blog word counts. Since I didn’t want to bloat the site with additional JS in GTM to scrape pages in real-time, I figured I’d acquire the pages in a different way. I realised I could create a Python script that could crawl the site at a time when there would not be much traffic and extract the information that way.
I’m not going to go into the laborious, line-by-line detail of how the site-crawling code works as I’m already spilling over into Page 2 of this document. But once the script is running however, it identifies how many blog list pages there are, loops through those blog list pages for the blog URLs and then scrapes each one for their HTML content. Having inspected a blog page already I’ve identified the necessary elements that contain the text of the blog – that text gets extracted, cleaned (words made up entirely of punctuation e.g. code snippets will be thrown out) and counted. During this process I can also scrape other pieces of information if they are available (such as the blog author and the date the blog was published from any tags for example).
Once the data has been scraped, it’s formatted into a dataframe and ready to be exported into a designated BigQuery table. Since this script is designed to be run periodically, I’ve included a code block that queries the existing table and compares it with the current dataframe, if the two are identical then there hasn’t been a new blog posted and there’s no need to update BigQuery. If the data frame is different from what is currently in BigQuery however, the table is overwritten with what was obtained during the scrape.
Now that there’s data available in BigQuery, the BQ_to_GA repository can finally be set up. The repo comes with a README.md file which describes the necessary steps required to deploy the cloud function. The prerequisites to this are:
- Access to a Google Cloud Platform project (with billing enabled), and an active Google Analytics account.
- Enable API for Cloud Source Repository, Cloud Functions and Cloud Scheduler in the GCP project.
- Following step 1 of Analytics API quickstart to enable the Analytics API in the GCP project and adding the service account’s email id to the Google Analytics account.
Once all the above has been set up, it’s simply a case of cloning the repo into a Cloud Shell session, installing the necessary Python dependencies and configuring the config.properties file with the necessary information (service account key, BigQuery info, and GA info etc.).
There is a function included where you can test the connection to Google Analytics where the GA property ID is returned, if that is successful then the Cloud Function can be deployed. In doing this a PubSub topic is referenced (or created if it doesn’t exist) and Cloud Scheduler can be used to invoke this topic on a desired schedule.
After a successful deployment and seeing a new file appear in GA’s Data Import, I altered the cloud function to run my page-crawling, word-count-scraping code to run first. Since there are only a hundred or so blog posts on the Measurelab website this process can be handled in one go with an execution time of around 2 minutes (Cloud Functions have a maximum timeout of 9 minutes or 540 seconds). For larger scrapes or code that takes extra time to run the process would need to be separated out some more.
So, in conclusion, any time a blog is posted on the Measurelab website (like this one), each night at 23:45 some python code is invoked, a table in BigQuery is updated and a new file is uploaded to GA, all in a matter of minutes – without anyone needing to push any buttons.
Any comments, queries or feedback can be sent to victor@measurelab.co.uk