Behind the Cloud – Releases and scheduling in Dataform

In this episode of Behind the Cloud, Matt dives into the details of releases and scheduling in Dataform. He breaks down how to manage different versions of your codebase in GitHub. From taking snapshots, to scheduling executions at various intervals—daily, hourly, or monthly.

By the end of the episode, you’ll have the know-how to confidently release and schedule your code, making it easier to build robust tables and models with Dataform.

Watch below 👇 on Youtube, or catch more from the series here.

Transcript

Introduction to releases and scheduling

[00:00:00] Hello and welcome to another episode of Behind the Cloud. Today we’re going to be looking at Dataform again, specifically this time we’re going to be exploring releases and scheduling. So releases and scheduling are where you take versions of your codebase that you will have sitting up in GitHub, in your production or your development branch. You take snapshots and schedule those to be run on a daily, hourly, monthly basis, however long you want to be scheduling your code for. 

We’ll also talk about how you can run different versions of your code. Like a development and production version, and schedule backfills and things like that. So hopefully by the end of this you’ll find it really simple to be able to go away, and release and schedule versions of your code. And then start building out tables and models with Dataform. 

Navigating the dataform repository

[00:00:44] So to look at releases and scheduling, you can just go to your main repository within Dataform, then you’ll see the same four menus we always see. Development workspaces where we have our development workspaces, workflow executions that have run and then releases and scheduling, which we’ve yet to look at so far.

Understanding release and workflow configurations

[00:01:02] Within releases and scheduling you’ve got two main menus in Dataform. Release Configurations and Workflow Configurations. Release Configurations can be thought of as taking a snapshot of the code base from a specific branch within GitHub. Whereas Workflow Configurations are more like scheduling those snapshots of code from your GitHub repositories. So we’ve got a production one already made here. This will, on an hourly basis, go to the main branch within our GitHub repository, get a copy of that, or a snapshot of that code base. And then once a day at 7am, this production runs that code base. It’s just a way of being able to move code around, push things up, and make sure you’ve got the latest version of the code available to be run against.

[00:01:49] Some of this language and some of the way they describe these things like release configurations and workflow configurations are a bit hard to grasp and a bit opaque. I think I’ve done a bit of a better job of explaining exactly what these things are. In the interest of trying to explain it a little bit more, let’s go through setting one up.

Creating a release configuration

[00:02:05] So, release configuration, let’s create one. First and foremost is what’s this release ID going to be called? We’ve already got a production, so let’s call this one dev. So this is going to be a snapshot of our dev code base from GitHub. Git commish is essentially the branch within GitHub. So again, dev.

[00:02:21] You could obviously point this at your main branch. Maybe you’ve got a branch off of dev that you’ve been working on various features, so you could set up a trigger onto that separate branch as well. The schedule frequency is how often a snapshot of that codebase is taken.

[00:02:37] Like I said, you can do this on an hourly basis, you can do this on a daily, on a weekly, on a monthly, on a custom frequency if you wish. You could also just run it on demand and maybe that’s better here because we’re setting up a dev, a dev run. We don’t necessarily need to be taking snapshots every hour of this so we could just come in whenever we need to run dev and have a new set of test data pulled out of our model, we can just run this on demand and compilation overrides are exactly like they are.

Compilation overrides and schema suffixes

[00:03:06] With your main container, within the settings these are just overriding things like the cloud project or the schema suffix or the table prefix to differentiate them from other tables and schemas that you may be running from a production basis. So for example, you may have a data form data set that all your production data is running into.

[00:03:26] And for your dev purposes, you might just have a schema of dev so that anything you run from the dev branch will go into data form underscore dev rather than messing around or potentially overriding your production data, which would obviously be a big problem. We wanna keep a firewall between production and development wherever possible.

[00:03:44] Equally, some people will have an entire Google Cloud project dedicated to development, and an entire Google Cloud project to production. So you could also go that way around and separate those things out in that way. 

Managing multiple branches and analysts

You may also have multiple people within the same organisation within the same repo. Multiple analysts who all have their own branches. So you can have a Matthew branch, or a Clive branch in Dataform. And we could have our workspace names pre prefix to the tables. So any tables we output if say if we were working on the same table Or prefix of our names to keep those separate and stop us stepping on each other’s toes I think for our instance just make it just Dev is going to be fine.

[00:04:26] Really important. You pay attention here and put a suffix in if it’s needed, because if you don’t and you run this, it could inadvertently overwrite your production data, which would obviously not be brilliant. We also have these compilation variables. So these, like I said, this is getting a snapshot of code.

Using compilation variables

[00:04:43] And what you can do at a time of creating that snapshot is insert a number of variables that are then used within the data for the model to. To manipulate runs or to have specific variables that are used within that model. So one example could be that you have a series of like start dates. We could say like we’re going to start six days ago as an example. Within our workflow config we’d have within our workflow settings.

[00:05:10] yaml, we’d have a var Let’s go and just quickly show you this. 

[00:05:14] So just an example, I’ve jumped into a test repo. Let’s just go into this branch you see here. We’ve got a workflow settings yaml. And then inside of that workflow settings, yaml we’ve got this variables where we can have whatever we want. And we can use these variables throughout the rest of our our code base. So we could have a function somewhere that’s converting seven days to today, minus seven days. And creating a date out of it. And that date could be being used as a filter on various tables. So you get the idea, you could have as many of these different variables in here as you wish.

Backfilling

[00:05:46] So just to jump back over to our branch, what you can do at time of running this is you can have specific Compilation variables, so I could have a start date. And maybe in my default workflow settings at YAML it’s 7 days, but I want to run a big old backfill, so I’m going to change this to 30.

[00:06:03] Now when it takes a snapshot of this code, it’s going to override any of those variables that match this key within our workflow settings at YAML. And that snapshot will have our 30 day variable in it instead of 7. And thus, anywhere in the code that affects it will be affected. So you can create multiple versions of the same code base, but with passing in different variables.

Taking snapshots

[00:06:25] Equally, you could potentially be passing in table names and have very uniform models that you can apply in multiple different places and pass in different table names via variables. So you could just take different snapshots of the same code base with a different table name and fundamentally it outputs different data because it’s coming off different tables.

[00:06:41] This can be quite powerful once you start to think about the ways in which you could use it. This, essentially is a way of us getting a snapshot of our code within our development branch. And passing in any dynamic variables that we want to affect our model. So if I came in here I could grab a new compilation. So I could grab a new snapshot of the code base right there and then, which would you see displayed down here.

Running executions

And that would allow me to then start an execution because I’ve just gone and got a fresh copy of it. Equally, you can see production is running on an hour by hour basis. So we can see that each hour it’s grabbing a new version of that code base. So we can be pretty confident that we’ve got the up to date code base. And just run our pipeline if we needed to on demand, or we can get a new compilation.

[00:07:27] Like so and we’ll see here that the manual update of the compilation is sitting at the bottom there we can see all the settings for our for our production run here. So we’ve not got any compilation overrides etc So that’s a snapshot of a code but equally we can also set up a workflow configuration which like I said is a schedule you create a configuration ID, so let’s just call this one production manual.

Setting up workflow configurations in Dataform

[00:07:52] And you select a release configuration, you select essentially a snapshot of the code that we’ve just created above. So in this case, let’s just stick with production. As long as your service account has access to all the things it needs to have access to, you can leave this blank. It will use the service account that’s attached to your dataform project.

[00:08:08] Let’s make this one on demand. And then this is where you can select specific actions, specific tags or all actions. Where this can get really useful is maybe you’ve got multiple dashboards. And you want to separate out the running of all those dashboards. You could do that via tags. So you could have multiple different workflow configurations that are running different dashboards. From a release configuration, or alternatively you could have daily, weekly, monthly tags for different tables and models that need to run at different frequencies. In which case you could select a daily tag from your tags, and then that would select all of the tables and models that are marked up as daily.

Using tags for scheduling

[00:08:49] And then you could do the same for weekly. And then you could schedule them on that basis. So for anything that’s tagged as daily, I can run on a daily basis. Anything that’s tagged as weekly, I can run on a weekly basis. So this is just a way of beginning to be able to pull apart your overall repo to be able to schedule things and break them apart and make sure that means that everything doesn’t have to run all at once.

[00:09:11] It can run based on how often you want it to. You can also specify selected actions and run selected actions and specific tables. That’s if you only want to run these things and not necessarily tag it. And you can decide if that is going to just run that one action. Or if it will run anything downstream also. Or anything upstream also by using include dependencies or include dependents.

Full refresh and table rebuilds

[00:09:36] And then, finally, you can do a full refresh of the table. If you’ve made changes to the schema, for example, or anything significant to the table. You need to drop that table and recreate it you can do that. Click full refresh and it will rebuild that table from scratch when this is enabled.

Summary

[00:09:53] So it may be that you could just have sitting here a workflow configuration. And it is a full refresh just in case you ever do any large updates and changes. So in a nutshell, that is releases and scheduling in Dataform. You create a release configuration that is a snapshot of your code base. Passing in any dynamic variables that you want to pass in. You update that either manually or on a regular basis. Like an hourly, weekly or monthly. You then create a workflow configuration which specifies the specific tags that you tags actions. Or actions that you want to run, and the cadence and timeframes of which you want to run those for.

[00:10:27] And you select if you want to refresh things, want to have dependents or dependencies included. And that is like a nice, easy way of breaking things apart, scheduling things and getting multiple different branches and actions of your Dataform model running. 

[00:10:40] Thanks for watching. Hopefully you enjoyed that video. As always, please like and subscribe to be notified whenever we release new videos. If you’ve got any suggestions of anything that you’d like us to cover, please give us a shout. And we’ll be sure to try and get that in the backlog. Thank you very much.

Share:
Written by

Subscribe to our newsletter: