Modern Data Stack in a Box with DuckDB

TLDR: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDBMeltanodbt, and Apache Superset.

This post is a collaboration with Jacob Matson and cross-posted on DuckDB.org.

Summary

There is a large volume of literature (123) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB TwitterGitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools!

Motivation

What is the Modern Data Stack, and why use it? The MDS can mean many things (see examples here and a historical perspective here), but fundamentally it is a return to using SQL for data transformations by combining multiple best-in-class software tools to form a stack. A typical stack would include (at least!) a tool to extract data from sources and load it into a data warehouse, dbt to transform and analyze that data in the warehouse, and a business intelligence tool. The MDS leverages the accessibility of SQL in combination with software development best practices like git to enable analysts to scale their impact across their companies.

Why build a bundled Modern Data Stack on a single machine, rather than on multiple machines and on a data warehouse? There are many advantages!

  • Simplify for higher developer productivity
  • Reduce costs by removing the data warehouse
  • Deploy with ease either locally, on-premise, in the cloud, or all 3
  • Eliminate software expenses with a fully free and open-source stack
  • Maintain high performance with modern software like DuckDB and increasingly powerful single-node compute instances
  • Achieve self-sufficiency by completing an end-to-end proof of concept on your laptop
  • Enable development best practices by integrating with GitHub
  • Enhance security by (optionally) running entirely locally or on-premise

If you contribute to an open-source community or provide a product within the Modern Data Stack, there is an additional benefit!

Trade-offs

One key component of the MDS is the unlimited scalability of compute. How does that align with the MDS-in-a-box approach? Today, cloud computing instances can vertically scale significantly more than in the past (for example, 224 cores and 24 TB of RAM on AWS!). Laptops are more powerful than ever. Now that new OLAP tools like DuckDB can take better advantage of that compute, horizontal scaling is no longer necessary for many analyses! Also, this MDS-in-a-box can be duplicated with ease to as many boxes as needed if partitioned by data subject area. So, while infinite compute is sacrificed, significant scale is still easily achievable.

Due to this tradeoff, this approach is more of an “Open Source Analytics Stack in a box” than a traditional MDS. It sacrifices infinite scale for significant simplification and the other benefits above.

Choosing a problem

Given that the NBA season is starting soon, a monte carlo type simulation of the season is both topical and well-suited for analytical SQL. This is a particularly great scenario to test the limits of DuckDB because it only requires simple inputs and easily scales out to massive numbers of records. This entire project is held in a GitHub repo, which you can find here: https://www.github.com/matsonj/nba-monte-carlo.

Building the environment

The detailed steps to build the project can be found in the repo, but the high-level steps will be repeated here. As a note, Windows Subsystem for Linux (WSL) was chosen to support Apache Superset, but the other components of this stack can run directly on any operating system. Thankfully, using Linux on Windows has become very straightforward.

  1. Install Ubuntu 20.04 on WSL.
  2. Upgrade your packages (sudo apt update).
  3. Install python.
  4. Clone the git repo.
  5. Run make build and then make run in the terminal.
  6. Create super admin user for Superset in the terminal, then login and configure the database.
  7. Run test queries in superset to check your work.

Meltano as a wrapper for pipeline plugins

In this example, Meltano pulls together multiple bits and pieces to allow the pipeline to be run with a single statement. The first part is the tap (extractor) which is ‘tap-spreadsheets-anywhere‘. This tap allows us to get flat data files from various sources. It should be noted that DuckDB can consume directly from flat files (locally and over the network), or SQLite and PostgreSQL databases. However, this tap was chosen to provide a clear example of getting static data into your database that can easily be configured in the meltano.yml file. Meltano also becomes more beneficial as the complexity of your data sources increases.

plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
# data sources are configured inside of this extractor

The next bit is the target (loader), ‘target-duckdb‘. This target can take data from any Meltano tap and load it into DuckDB. Part of the beauty of this approach is that you don’t have to mess with all the extra complexity that comes with a typical database. DuckDB can be dropped in and is ready to go with zero configuration or ongoing maintenance. Furthermore, because the components and the data are co-located, networking is not a consideration and further reduces complexity.

  loaders:
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4
    config:
      filepath: /tmp/mdsbox.db
      default_target_schema: main

Next is the transformer: ‘dbt-duckdb‘. dbt enables transformations using a combination of SQL and Jinja templating for approachable SQL-based analytics engineering. The dbt adapter for DuckDB now supports parallel execution across threads, which makes the MDS-in-a-box run even faster. Since the bulk of the work is happening inside of dbt, this portion will be described in detail later in the post.

  transformers:
  - name: dbt-duckdb
    variant: jwills
    pip_url: dbt-core~=1.2.0 dbt-duckdb~=1.2.0
    config:
      path: /tmp/mdsbox.db

Lastly, Apache Superset is included as a Meltano utility to enable some data querying and visualization. Superset leverages DuckDB’s SQLAlchemy driver, duckdb_engine, so it can query DuckDB directly as well.

  utilities:
  - name: superset
    variant: apache
    pip_url: apache-superset==1.5.0 markupsafe==2.0.1 duckdb-engine==0.6.4

With Superset, the engine needs to be configured to open DuckDB in “read-only” mode. Otherwise, only one query can run at a time (simultaneous queries will cause locks). This also prevents refreshing the Superset dashboard while the pipeline is running. In this case, the pipeline runs in under 8 seconds!

Wrangling the data

The NBA schedule was downloaded from basketball-reference.com, and the Draft Kings win totals from Sept 27th were used for win totals. The schedule and win totals make up the entirety of the data required as inputs for this project. Once converted into CSV format, they were uploaded to the GitHub project, and the meltano.yml file was updated to reference the file locations.

Loading sources

Once the data is on the web inside of GitHub, Meltano can pull a copy down into DuckDB. With the command meltano run tap-spreadsheets-anywhere target-duckdb, the data is loaded into DuckDB, and ready for transformation inside of dbt.

Building dbt models

After the sources are loaded, the data is transformed with dbt. First, the source models are created as well as the scenario generator. Then the random numbers for that simulation run are generated – it should be noted that the random numbers are recorded as a table, not a view, in order to allow subsequent re-runs of the downstream models with the graph operators for troubleshooting purposes (i.e. dbt run -s random_num_gen+). Once the underlying data is laid out, the simulation begins, first by simulating the regular season, then the play-in games, and lastly the playoffs. Since each round of games has a dependency on the previous round, parallelization is limited in this model, which is reflected in the dbt DAG, in this case conveniently hosted on GitHub Pages.

There are a few more design choices worth calling out:

  1. Simulation tables and summary tables were split into separate models for ease of use / transparency. So each round of the simulation has a sim model and an end model – this allows visibility into the correct parameters (conference, team, elo rating) to be passed into each subsequent round.
  2. To prevent overly deep queries, ‘reg_season_end’ and ‘playoff_sim_r1’ have been materialized as tables. While it is slightly slower on build, the performance gains when querying summary tables (i.e. ‘season_summary’) are more than worth the slowdown. However, it should be noted that even for only 10k sims, the database takes up about 150MB in disk space. Running at 100k simulations easily expands it to a few GB.

Connecting Superset

Once the dbt models are built, the data visualization can begin. An admin user must be created in superset in order to log in. The instructions for connecting the database can be found in the GitHub project, as well as a note on how to connect it in ‘read only mode’.

There are 2 models designed for analysis, although any number of them can be used. ‘season_summary’ contains various summary statistics for the season, and ‘reg_season_sim’ contains all simulated game results. This second data set produces an interesting histogram chart. In order to build data visualizations in superset, the dataset must be defined first, the chart built, and lastly, the chart assigned to a dashboard.

Below is an example Superset dashboard containing several charts based on this data. Superset is able to clearly summarize the data as well as display the level of variability within the monte carlo simulation. The duckdb_engine queries can be refreshed quickly when new simulations are run.

season summary & expected wins
playoff results

Conclusions

The ecosystem around DuckDB has grown such that it integrates well with the Modern Data Stack. The MDS-in-a-box is a viable approach for smaller data projects, and would work especially well for read-heavy analytics. There were a few other learnings from this experiment. Superset dashboards are easy to construct, but they are not scriptable and must be built in the GUI (the paid hosted version, Preset, does support exporting as YAML). Also, while you can do monte carlo analysis in SQL, it may be easier to do in another language. However, this shows how far you can stretch the capabilities of SQL!

Next steps

There are additional directions to take this project. One next step could be to Dockerize this workflow for even easier deployments. If you want to put together a Docker example, please reach out! Another adjustment to the approach could be to land the final outputs in parquet files, and to read them with in-memory DuckDB connections. Those files could even be landed in an S3-compatible object store (and still read by DuckDB), although that adds complexity compared with the in-a-box approach! Additional MDS components could also be integrated for data quality monitoring, lineage tracking, etc.

Josh Wills is also in the process of making an interesting enhancement to dbt-duckdb! Using the sqlglot library, dbt-duckdb would be able to automatically transpile dbt models written using the SQL dialect of other databases (including Snowflake and BigQuery) to DuckDB. Imagine if you could test out your queries locally before pushing to production… Join the DuckDB channel of the dbt slack to discuss the possibilities!

Please reach out if you use this or another approach to build an MDS-in-a-box! Also, if you are interested in writing a guest post for the DuckDB blog, please reach out on Discord!

Three steps to handling sharded databases with dbt

A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:

just keep sharding, just keep sharding…

This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.

I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).

Step 1: leverage YAML anchors and aliases

Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.

https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet

By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.

  - name: BASE_DATABASE
    database: CUSTOMER_N
    schema: DATA
    tables: &SHARD_DATA
      - name: table_one
        identifier: name_that_makes_sense_to_eng_but_not_data
        description: a concise description
      - name: table_two

  - name: CUSTOMER_DATABASE
    database: CUSTOMER_N+1
    schema: DATA
    tables: *SHARD_DATA

Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.

Step 2: Persist a list of your sharded databases

This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.

(SQL SERVER)
SELECT * FROM sys.databases;

(SNOWFLAKE)
SELECT * FROM information_schema.databases

You can then persist that information in a dbt model that you can query later.

The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.

Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.

{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
  SELECT field_list,
    '{{ sn }}' AS source_name
  FROM {{ source( sn , 'table_one' ) }} one
    INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
  {% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!

Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.

(SQL SERVER)

{{ config(
    materialized = "table",
    pre_hook="
      DROP TABLE IF EXISTS #source;
      CREATE TABLE #source
      (
        some_field INT
      );

      {% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
      {% for sn in source_names %}
        SELECT field_list,
          '{{ sn }}' AS source_name
        FROM {{ source( sn , 'table_one' ) }} one
          INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
       {% endfor %} 
       DROP TABLE IF EXISTS target;
       SELECT * INTO target FROM #source",
    post_hook="
      DROP TABLE #source;
      DROP TABLE target;"
  )
}}    

SELECT * FROM target

So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!

Alternative methods: using dbt_utils.union_relations

In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.

Optimizing SQL queries for speed with dbt

Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.

But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.

Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.

the first dbt run of the query

For reference, you can find the query here, but I’ve pulled a point in time copy below.

{{ config( tags=["old"] ) }}

WITH guesses as (
      SELECT 
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      FROM {{ ref( 'wordle' ) }} ), 
answers as (
      select
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      from {{ ref( 'answer' ) }} ), 
crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,
            CASE 
                  WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a2_match,
            CASE 
                  WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a3_match,
            CASE 
                  WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a4_match,
            CASE 
                  WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a5_match
      from guesses
      cross join answers), 
count_answers as (
      select 
            guess,
            answer,
            a1_match + a2_match + a3_match + a4_match + a5_match as total
      from crossjoin), 
maths_agg as (
      select
            guess,
            sum(total),
            avg(total) avg,
            stddev(total),
            max(total),
            min(total)
      from count_answers
      group by guess
      order by avg desc ), 
final as (
      select * 
      from maths_agg )

select * 
from final

The first optimization

The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.

With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.

getting faster

The second optimization

The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.

So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.

one query, now multiple models + 1 DAG

Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.

much better

The third optimization

Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.

crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,

...

      from guesses
      cross join answers

First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.

Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.

increase thread count to 6!

With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:

{{ config(
 tags=["new","opt"],
 materialized="table"
 ) }}

-- Since I have 6 threads, I am creating 6 partitions

SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}
dag-tastic!

Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.

select
guesses.word as guess,
answers.word as answer,

...

from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1 
      AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join  {{ ref( 'answers' ) }} answers

Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.

I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.

why shouldn’t I use all 6 cores?

This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).

17 seconds! pretty good

Real world applications

This optimization, taken as a whole, worked for a few reasons:

  • It’s trivial to add more compute to a problem, although there is real hard costs incurred.
  • The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
  • Databases are great at running queries in parallel.

These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.

Some constraints to consider when optimizing with parallelization:

  • Read/Write throughput maximums
  • Holding the relevant data in memory
  • Compute tx per second

This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.

If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.

Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.

PS – The best two-word combo I could come up using this code is: EARLS + TONIC.

Install dbt on Win10 – April 2021 (Updated October 2021)

I was getting a little frustrated with the web interface of dbt cloud, and just wanted to feel more “in control” of my environment. Which lead to this twitter thread:

Which led to lots of good recommendations (for Atom, POP SQL, vim+tmux, DeepChannel, and some others) but ultimately I settled on VS Code after a few auspicious DMs.

The process to install dbt on Win10 isn’t exactly friendly for an analyst using dbt without engineering experience, so I wanted to share my journey and hopefully make yours easier too.

Step 0: Install Python

Note: These steps have been tested with Python 3.9.6 and earlier. As of October 15th, 2021, Python 3.10.x is not working!

Before you do anything else, install python. Make sure to check the box to “Add Python to PATH”. If you don’t, you can only run it explicitly.

low quality screenshot – but peep the box at the bottom.

If you miss this step, you have two options:
1 – re-install of python and check the “Add Python to PATH” box.
2 – manually add the PATH for python; one example linked here.

To confirm it’s working, open the command line and enter ‘py –version’. It should return the version of python you installed. If you get an error, most likely it is a PATH issue.

Step 1: Install VS Code + MS Build tools

First, download links: VS Code, MS Build tools.

As a quick call out, you need MS Build tools for MSVC v140 or higher, which is an optional component of the C++ build tools.

more low quality images, but check the right box.

The MS Build tool install takes a bit of time, so I would do this when you have a little bit of time (15 minutes or so), and you will need to restart your computer.

Step 2: Config your VS Code Environment

There are a few items that need to be done to get VS Code ready for dbt, so I will list them here.

  • Open the command palette (Ctrl+Shift+P), and type ‘Python: Select Interpreter’. It should then bring up and allow you to select your python interpreter.
  • Add the dbt power user plug-in.
  • Take a look at this article for more suggested plug-ins from the dbt team.
  • Open your command line, and update pip to the latest version with the command ‘pip install –upgrade pip’.
    • If you get an error here, you may need to run it with the ‘–user’ flag as well, but I got that behavior inconsistently.

Step 3: Install dbt on Win10

If you’ve made it this far, congrats. If you are finding this article because your ‘pip install dbt’ isn’t working, go back to the top and work through the above steps first.

Note: As of October 15th, 2021, this is paragraph is no longer required, but I’ve kept it for historical reasons. In your command line, run the following: ‘pip install dbt –no-use-pep517 cryptography‘. The dependency chain is broken somewhere and this flag fixes it. Do I know why? No. Do I care? Also, no.

Go ahead and run ‘pip install dbt’ in your CLI. This takes a few minutes to run (5-10 mins), but when it’s done, you can check by running ‘dbt –version’ in the CLI. It should return the latest version (as of this update, 0.20.1).

Lastly, I recommend running ‘dbt init‘ to set up your initial ‘.dbt’ folder that holds your profiles.yml file to allow you to connect to your data warehouse. If you don’t run it, you will just need to create that file by hand later. Since that file contains your credentials, it is best practice to put that in another place outside of your source control.

And with that, you are ready to connect to your repo and begin working on your dbt project. And lastly – share your Ws on twitter!