Behind the Cloud: Setting up a Dataform project within BigQuery
In this episode of Behind the Cloud, Matt demonstrates how to enable and set up a Dataform project within BigQuery, connect it to GitHub, and initialise the workspace for building a Dataform project. Matt walks us through enabling BigQuery, creating a repository, setting up the region, and using service accounts.
Watch below 👇 or head over to our YouTube channel
Transcript
Introduction to Dataform in BigQuery
[00:00:00] Matt: Hello and welcome to this week’s behind the cloud sticking with the practical theme today. We’re going to be showing you how to Enable and set up a data form project within big query connect that all up to GitHub and then initialise the initial workspace get it all committed and all synced up ready for beginning to build a data form project So without further ado, let’s jump over into GCP.
Enabling BigQuery and Accessing Dataform
[00:00:28] Matt: So first things first we’re going to need BigQuery enabled So let’s go up here and select BigQuery enable There we go. BigQuery is enabled on the left hand side within the BigQuery menu. You’ll see Dataform. So we click into Dataform. You can see right now we have no repositories. So we’re going to click create repository.
[00:00:44] Matt: Repository ID. This can be a unique name within this project. Let’s just call it a demo, set up a repo region. Important to put this in the same place where a majority of your data is going to be sitting within BigQuery. Because it’s going to make life a little bit easier. So I’m going to put this in Europe and then a service account.
[00:01:02] Matt: It’s not always advisable to use. Default service accounts for production stuff, but for initial setup and getting things going Um, it’s fine Just once you’re ready and things are ready to be pushed live go set up a service account that has only the permissions It needs to do what it needs to do so i’m just going to stick with the default for right now and click create fantastic So the repository has been created so I can just click go back to repositories and then I can see that here So now you can click into this repository.
Exploring Repository Settings
[00:01:27] Matt: You can see we’re in our repository menu. What we’ll do first is we’ll jump over to the settings menu. Here is where you can see a few different pieces of information about the setup. So you can see the name, location, service account. At the top here, you can see connect with Git. So I’m just going to click that.
Connecting Dataform to GitHub via HTTPS
[00:01:41] Matt: And then you can see there’s a couple of different ways of connecting into GitHub. We’re going to connect to it using HTTPS. Click HTTPS, then we need our Git repository URL. So it’s worth noting that you’ll want a repo set up on GitHub, because that’s what you’re going to connect to your repo within BigQuery and Dataform.
[00:01:57] Matt: So I just created an empty repo here, which just has a README file and nothing else. What we’re going to need to start connecting this up is we’re going to need the URL, which can be found within this code block here. I’m going to copy that URL and we’re going to head back over to BigQuery. So we’ll place the URL here.
[00:02:12] Matt: We’ll put in our default branch name. In our case currently the default is main, so we’ll just type in main. And here’s our secret. And the secret is a Google secret, which is our GitHub token. These github tokens are created within github and they give the token a specific set of github repositories. So there are certain set of permissions that it’s going to need in order to perform its actions.
Generating a GitHub Token
[00:02:35] Matt: Get a token, head up to the top right hand corner of github, select settings. Right down at the bottom you’ll see developer settings, personal access token, and then you can either use classic tokens or fine grain tokens. It may be best to use fine grain tokens and give all of the access that’s needed. So I’m going to click generate a new token and give your token a name.
[00:02:53] Matt: expiration and the description so you understand what it is and then specify only on selected repositories and select the repository that you wanted to link up. There’s all the different permissions within here you can find more information about those permissions that need to be granted within the documentation because I don’t have time to do it in this video.
[00:03:11] Matt: So I’ve now set everything up I need to in here and now I’m just going to generate the token. Here you’ll see your token keep this safe once you’ve copied it. You’re not going to be able to see it again. Now we head back over to Dataform where we’re going to add in a Google secret. And these are secure encrypted ways of accessing sensitive information.
Using Google Secret Manager
[00:03:31] Matt: So you can place the secret within the secret manager and then Dataform can access securely that secret without it being seen publicly. Only services and people with the correct permissions are ever going to be able to see and access those unencrypted tokens. So to do that, we’re going to need to. Go into the secret manager and enable the API.
[00:03:46] Matt: Most of the free tiers within Google are very generous, but it’s always worth keeping a note that none of this stuff is technically free and you are at some point going to pass those totals. Once we’ve done that, we’re in secret manager here. We can click create secret. Let’s call it demo secret and then we can add our secret value into the box here and leave everything else as is for now create secret.
[00:04:08] Matt: There we go. Our secret is created. You can have multiple different versions of a secret So if you were having your token run out every sort of 30 days or so You could keep coming in and adding new versions of that token now that we’ve done that let’s head back to our previous page Where we have the link, the branch name, and now we should be able to see our secret, I’ve misspelt secret, but nevermind that we’ve just added to our secret manager, which we can then select.
Adjusting IAM Permissions for Dataform
[00:04:30] Matt: So now it has everything it needs in order to connect up to that Git repository and to validate with the AXS token. If I click the link, you’ll see a problem. And that problem is that the default service account doesn’t have the necessary permissions in order to access secrets out of the box. So you’re going to need to go to IAMs and you’re going to need to add in those permissions.
Successfully Linking GitHub to Dataform
[00:04:51] Matt: So let’s quickly jump over to IAM. Let’s find our Dataform service, add a role, which will just be secret manager accessor. Now, once again, we can head back over to Dataform into our repo. Go to our settings, edit that connection, update, and now it’s successfully linked. I left that in there just to demonstrate the default service account for Dataform doesn’t have all the necessary permissions that you would think it would.
[00:05:16] Matt: The secrets is a good example, but you’ll also need to give it BigQuery permissions, um, in order for it to enact any actions within BigQuery. Now we’ve got this connected up to GitHub, we can go and create ourself a workspace. The workspace is analogous to a branch within GitHub. So, we will create a new development workspace, and this will ask us for a name.
Creating a Development Workspace
[00:05:36] Matt: Within the Dataform documentation, it recommends that you name a branch after a person. So, potentially you’d have a different workspace or branch for each analyst. In this case, I’ll put Matthew to create. And then we’ve got our workspace created. I’m going to click into this workspace. See, it’s already added the readme in because it’s found within our GitHub repository, but there’s nothing else in here right now.
Initialising the Workspace
[00:06:00] Matt: So the last thing we want to do is we want to initialise the workspace. Which will bring in and create all the template files that are for data form that we can use as a kickoff point It’ll have all the information about our project etc in our Dataform json and it’s this is where we’d be starting off.
[00:06:16] Matt: We can commit those changes to our branch. So we’ll select all of those changes. I’ll just put the initial commit. They’re ready for commit. And now we can just push those up to the branch. Then if we visit our GitHub repository, we’ll see that there’s a new branch, Matthew. And within, we’ve got all those files that were just created.
Committing and Pushing to GitHub
[00:06:32] Matt: And I can compare and pull that into the main branch. Perhaps correct the spelling of initial. I’m not going into any detail here around the GitHub stuff. Your organisation probably has its own best practices and ways of doing things within Git. Let’s merge that, confirm, and then go back to our code, you’ll see within main we have all of these files that are from that workspace, if we wish we can now sync back up with that main branch, which won’t make a difference to ours, and that’s it, so we have created a new repository, we have created a new workspace, I’ve hooked our data form up to GitHub and we have initialised that initial workspace and sync that between data form and GitHub.
[00:07:11] Matt: And that’s it for this week’s Behind the Cloud. As always, if you want to keep up to date with any new videos that we release, please like, and subscribe to be notified when those go up. We’ve got a long list of ideas and new things we can be exploring in Dataform and all of the GCP, so keep in touch.