Simple dbt runner

This post was co-written by Alex Noonan, Dani Mermelstein & Jacob Matson.

Introduction

In light of the recent price increases observed in products like dbt Cloud, coupled with the general pressure to maintain cost efficiency, we wanted to build a project template for running dbt in a production environment using GitHub Actions & AWS S3. Behold: the simple dbt runner!

The key features of this project are:

Scheduled Runs: You can set up automated dbt commands to run on a schedule, ensuring that your data modeling and transformation tasks are executed reliably and consistently.

Post-PR Merges: After merging a pull request into your project’s main branch, you have the option to trigger dbt runs. We recommend choosing either a full run or a state-aware run (which focuses only on modified models) to keep your project organized and efficient.

PR Commits Testing: To enhance your development process, dbt CI runs automatically on pull request commits. This helps you ensure that any changes you make are compatible and do not introduce unexpected issues into your data pipelines.

State Awareness: To utilize the state-aware workflow, it’s important to set up an S3 bucket to persist the manifest.json file. Additionally, Leveraging an S3 bucket to host the project documentation website, streamlines the documentation creation and adjustments within the development process.

Project and Environment Setup

1. Fork this repo and copy your whole dbt project into the project_goes_here folder.
2. Update your repository settings to allow GitHub Actions to create PRs. This setting can be found in a repository’s settings under Actions > General > Workflow permissions. It should look like this:

3. Go to the Actions tab and run the Project Setup workflow, making sure to select the type of database you want to set up – This opens a PR with our suggested changes to your profiles.yml and requirements.txt files. We assume if you’re migrating to self-hosting you need to add a prod target to your profiles.yml file, so this action will do that for you and also add the database driver indicated.
4. Add some environment variables to your GitHub Actions secrets in the Settings tab. You can see which vars are needed based on anything appended with ${{ secrets. in the open PR. Additionally, you need to define your AWS secrets to take advantage of state-aware builds – AWS_S3_BUCKET, AWS_ACCESS_KEY, & AWS_SECRET_KEY.
5. Run the Manual dbt Run to test that you’re good to go.
6. Edit the Actions you want to keep and delete the ones you don’t.

GitHub Actions Overview

Initially, we wanted to build out the project to a boilerplate CloudFormation stack that would create AWS resources to run a simple dbt core runner on EC2. We pivoted to using GitHub actions for cost and simplicity. GitHub gives you 2,000 free minutes of runner time. This works well for personal projects or organizations with sub-scale data, and if you need to scale beyond the free minutes, the cost is reasonable. Building with Github actions easily facilitates continuous integration, allowing you to automatically build and test data transformations whenever changes are pushed to the repository. 

To cover most simple use cases we built some simple actions that run dbt in production to automate key aspects of your data pipeline. 

Scheduled dbt Commands: You can set up scheduled dbt commands to run at specified intervals. This automation ensures that your data transformations are consistently executed, helping you keep your data up-to-date without manual intervention.

Pull Request Integration: After merging a pull request into the main branch of your repository, you can trigger dbt runs. This is a valuable feature for ensuring that your data transformations are validated and remain in a working state whenever changes are introduced. You have the flexibility to choose between a full run or a state-aware run, where only modified models are processed. This granularity allows you to balance efficiency with thorough testing.

dbt CI Runs: Pull requests often involve changes to your dbt models. To maintain data integrity, dbt CI checks are performed on pull request commits. This ensures that proposed changes won’t break existing functionality or introduce errors into your data transformations. It’s a critical step in the development process that promotes data quality.

State-Aware Workflow: The state-aware workflow requires an S3 bucket to store the manifest.json file. This file is essential for tracking the state of your dbt models, and by persisting it in an S3 bucket, you ensure that it remains available for reference and consistency across runs. Additionally, this S3 bucket serves a dual purpose by hosting your project’s documentation website, providing easy access to documentation related to your data transformations.

S3 Bucket and docs update

Hosting your dbt docs on S3 is a relatively simple and cost-effective way to make your documentation available. The process to generate the docs and push them to s3 happens during the “incremental dbt on merge”, “dbt on cron” jobs. The docs get generated by the “dbt docs generate” command and then are pushed to S3 by the upload_to_s3.py file. Adding this step to the workflow ensures the documentation is always current without much administrative complexity. 

We added a CloudFormation template that creates an S3 bucket that is public facing as well as an IAM user that can get and push objects to the bucket. You will need to generate AWS keys for this user and add them to your project environment variables for it to work. If you are unfamiliar with CloudFormation we added some notes to the README.