The Many Wandering Paths to Analytics

If we treated careers more like dating, nobody would settle down so quickly.

David Epstein
Range: Why Generalists Triumph in a Specialized World

I consistently receive the same questions from people seeking an Analytics career: What classes should I take? What certifications should I get? Should I learn SQL, Python or R?

Behind those questions there’s a consistent assumption: “There must be a clear path to an analytics career.”

I’m here to challenge that assumption. There isn’t one clear path to work in analytics – most of us got there through a winding, wandering series of career moves. My story is one of many – ask someone in Analytics and you’ll hear something similar.

Typical Wandering Path

(1) Get a college degree or other training – not super relevant
(2) Work for a while in some job non-analytics related
(3) Recognize interest in analytics
(4) Start doing basic analytics at work (ideally) or on own time
(5) Leverage that experience into first analytics job

I’ll call out each step as it happened in my career journey.

My Wandering Path

Initial Career (Years 1-4)

Coming out of college I shared the assumption that careers were linear. After all, life to that point was linear, so why wouldn’t careers be the same?

Except, my linear plans fell apart two days before my wedding in 2011. I’d studied International Economic Development (Step 1), interned in Latin America, become fluent in Spanish and was planning to move with my soon-to-be wife to Bolivia. In one phone call and several subsequent conversations, that potential life and career ended. I was sitting in a dead-end job I thought I’d be leaving and had to figure out Plan B.

At first it wasn’t obvious – what else should I do? I was a Customer Success Manager (Step 2) but didn’t really want to do that as a career. I’d worked in sales departments, but didn’t really want to be a salesperson. But then I had an epiphany – there was a part of each of my first few roles that I loved that never was part of my job description.

I was consistently making little analytics & reports (Step 4 – which ironically for me came before step 3!). I’d turn 2,000 customer emails into a digestible summary for the product team. I’d make Salesforce forecasts & dashboards for the executive team. I made a Google Sheet for my Rosetta Stone team to help management track & manage renewal rates for their teams. This stuff was fun! I liked it! (Step 3) But what now?

The Great Filter: Landing First Data Job

Have you heard of the concept of the ‘Great Filter’? It’s part of the Fermi Paradox, which ponders why there is no extraterrestrial life given the seeming high probability it should exist in the universe. Within the Fermi Paradox, it’s the step getting from non-living matter to living matter (abiogenesis). The Great Filter is a catchall for “it’s hard to get past this point.”

I argue there is a Great Filter for those trying to get into Analytics – getting your first job. In fact, I’m devoting my next blog to this topic, so consider this a lead in to next week.

Passing Through the Great Filter

I realized I had an uphill climb ahead. Perhaps this is where many of you are – how do you get a company to take a chance on you?

I asked lots of current analysts via informational interviews at local coffee shops. They all said “I got here via a pretty random series of events.” Sound familiar?

They gave me a breadcrumb trail, though: “You have to get enough experience together and communicate about it well enough to get someone to try you out.” Easier said than done, but I did have some experience already in my current role.

I applied everywhere. I was told by recruiters/HR multiple times “Hey, I guess you could be an analyst but I think your future is in sales based on your resume.” Leads fizzled out until one day I got a call.

The Meeting That Changed Everything

“Can you show up at the office in an hour? The CFO and SVP of Sales want to talk.”

I got that call from Jacob — the same Jacob here at DataDuel. He was working at Funko, a quickly-growing collectibles company north of Seattle. They didn’t have a position open yet, but there was interest in getting analytics going. Before going into the meeting, here is all I knew:

  • The position was planned to be part-time Analytics and part-time something else until analytics skills were proven
  • The position was planned to be a contract position (not great – I’d just bought my first house and wasn’t looking for a contract spot)
  • There would be minimal support since no data team existed, so a self-starter attitude was needed

Given those three bullet points, I had three goals going in:

  • Communicate my potential to be great at analytics if given the chance
  • Sell them that I was worth a shot as an employee & not a contractor
  • Demonstrate self-starter attitude to analytics from previous roles

I quickly threw on a dress shirt, re-learned how to tie a tie on YouTube and flew out to the car.

The rest is history and full of the content I’ll fill this weekly series with. The conversation went really well and they decided to take a chance on me a couple weeks later (Step 5!). While I got a full-time position, I took a 10% pay cut because I needed to prove myself. I knew the temporary sacrifice would be worth it – I just needed my first position to get past the Great Filter.

In Conclusion

There’s no one path to analytics – there are many. I’ve used my path as an anecdote for the infinite options out there.

The general path, though, is to start doing some analytics in any fashion you can, and leverage that experience to get your first position. It isn’t easy – there’s a Great Filter out there which prevents many from getting in.

Come back next week and I’ll dive into the Analytics Great Filter in more detail, and provide some practical options of how to overcome it.

Passing dynamic parameters to SQL Server stored procedures with dbt

If you are using SQL Server with dbt, odds are that you probably have some stored procedures lurking in your database. And of course, the sql job agent is probably running some of those on a cron. I want to show another way to approach these, using dbt run-operations and GitHub actions. This will allow you to have a path towards moving your codebase into a VCS like git.

Unwrapping your wrapper with jinja

The pattern I am most familiar with is using the sql agent to run a “wrapper”, which servers to initialize the set of variables to pass into your stored procedure. The way I have done this with dbt is a bit different, and split into two steps: 1) writing the variables into a dbt model and 2) passing that query into a table that dbt can iterate on.

Since your model to stuff the variables into a table (step 1) is highly contextual, I’m not going to provide an example, but I will show how to pass an arbitrary sql query into a table. Example below:

{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

For those of you from the SQL Server world – the metaphor here is a temporary table. You can find more about run_query here.

Agate & for loops

What we have created with the run_query macro is an Agate table. This means we can perform any of the Agate operations on this data set, which is pretty neat! In our case, we are going to use a python for loop and pass in the rows of our table.

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% do run_query(stored_procs) %}
    {% set stored_procs = true %}
{% endfor %} 

The clever thing to do here with python is that we can pass multiple columns into our stored procedure, which differs from something like dbt_utils.get_column_values that can also be used as part of a for loop, but only for a single column. In this case we can reference which column to return from our table with variable[n], so i[0] returns the value in the first column in the current row, i[1] returns the second column and so on.

Building the entire macro

Now that we have the guts of this worked out, we can pull it together in an entire macro. I’m adding ‘dry_run’ flag so we can see what the generate SQL is for debugging purposes, without having to execute our procedure. As a side note, you could also build this as a macro that you run as pre or post hook, but in that case you would need to include an ‘if execute‘ block to make sure you don’t run the proc when project is compiled and so on.

-- Execute with: dbt run-operation my_macro --args '{"dry_run": True}'
-- to run the job, run w/o the args

{% macro my_macro(dry_run='false') %}
{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% if dry_run == 'false' %}
        {% do run_query(stored_procs) %}
    {% endif %}
    {% set stored_procs = true %}
{% endfor %}  
{% do log("my_macro completed.", info=True) %}
{% endmacro %}

Running in a Github action

Now that we have the macro, we can execute in dbt with ‘dbt run-operation my_macro’. Of course, this is great when testing but so no great if you want this in production. There are lots of ways you run this: on-run-start, on-run-end, as a pre or post-hook. I am not going to do that in this example, but instead share how you can run this a stand alone operation in github actions. I’ll start with the sample code.

name: run_my_proc

on:
  workflow_dispatch:
    # Inputs the workflow accepts.
    inputs:
      name:
        # Friendly description to be shown in the UI instead of 'name'
        description: 'What is the reason to trigger this manually?'
        # Default value if no value is explicitly provided
        default: 'manual run for my stored procedure'
        # Input has to be provided for the workflow to run
        required: true

env:
  DBT_PROFILES_DIR: ./
  MSSQL_USER: ${{ secrets.MSSQL_USER }}
  MSSQL_PROD: ${{ secrets.MSSQL_PROD }}
  MSSQL_LOGIN: ${{ secrets.MSSQL_LOGIN }}
   
jobs:
  run_my_proc:
    name: run_my_proc
    runs-on: self-hosted

    steps:
      - name: Check out
        uses: actions/checkout@master
      
      - name: Get dependencies # ok guess I need this anyway
        run: dbt deps --target prod

      - name: Run dbt run-operation
        run: dbt run-operation my_macro

As you can see – we are using ‘workflow_dispatch’ as our hook for the job. You can find out more about this in the github actions documentation. So now what we have in github is the ability to run this macro on demand with a button press. Neat!

Closing thoughts

One of the challenges I have experienced with existing analytics projects on SQL Server and dbt is “what do I do about my stored procedures”. They can be very hard to fit into the dbt model in my experience. So this is my attempt at a happy medium where you can continue to use those battle tested stored procedures while continuing build out and migrate towards dbt. Github actions is a simple, nicely documented way to start moving logic away from the sql job agent, and you can run it “on-prem” if you have that requirement. Of course, you can always find me on twitter @matsonj if you have questions or comments!

Three steps to handling sharded databases with dbt

A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:

just keep sharding, just keep sharding…

This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.

I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).

Step 1: leverage YAML anchors and aliases

Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.

https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet

By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.

  - name: BASE_DATABASE
    database: CUSTOMER_N
    schema: DATA
    tables: &SHARD_DATA
      - name: table_one
        identifier: name_that_makes_sense_to_eng_but_not_data
        description: a concise description
      - name: table_two

  - name: CUSTOMER_DATABASE
    database: CUSTOMER_N+1
    schema: DATA
    tables: *SHARD_DATA

Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.

Step 2: Persist a list of your sharded databases

This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.

(SQL SERVER)
SELECT * FROM sys.databases;

(SNOWFLAKE)
SELECT * FROM information_schema.databases

You can then persist that information in a dbt model that you can query later.

The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.

Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.

{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
  SELECT field_list,
    '{{ sn }}' AS source_name
  FROM {{ source( sn , 'table_one' ) }} one
    INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
  {% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!

Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.

(SQL SERVER)

{{ config(
    materialized = "table",
    pre_hook="
      DROP TABLE IF EXISTS #source;
      CREATE TABLE #source
      (
        some_field INT
      );

      {% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
      {% for sn in source_names %}
        SELECT field_list,
          '{{ sn }}' AS source_name
        FROM {{ source( sn , 'table_one' ) }} one
          INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
       {% endfor %} 
       DROP TABLE IF EXISTS target;
       SELECT * INTO target FROM #source",
    post_hook="
      DROP TABLE #source;
      DROP TABLE target;"
  )
}}    

SELECT * FROM target

So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!

Alternative methods: using dbt_utils.union_relations

In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.

Revisiting data query speed with DuckDB

I’m always really curious to learn more about optimization, especially as it relates to querying data. This lead me down the journey of watching this series of lectures by the CMU database group, which really opened my mind to how to get better performance out of my data pipelines.

One of the biggest realizations for me was in a slide in the CMU lectures that indicated >90% of compute usage in OLTP databases is NOT related to transactions (things like concurrency management & memory management). The insight for me was that by stripping away those requirements, I could get much faster performance. Initially, I probed SQL Server’s In-Memory OLTP functionality (aka Hekaton), but the feedback from people in my network was either “haven’t used it” or “it was a horrible experience, don’t waste your time.”

Around the same time, I was hearing a lot of chatter related to DuckDB. Install and setup was so simple, that I figured I would download it and mess around a little bit. Since I recently had done some optimization of queries related to wordle where I was able to improve query performance 53.8x, I figured it would be good to revisit it. To say I was blown away would be an understatement.

First, the process to install DuckDB is very simple. Assuming you already have some python knowledge, it’s a single-line install with pip. Adding the dbt connector was also very simple. In fact, setting up your dbt profile is as simple as:

duckdb:
  target: dev
  outputs:
    dev: 
      type: duckdb

But I digress, I actually didn’t need to even get into dbt to run this experiment. Just like my previous post, I am doing the testing with this query, which looks at two lists of words for the game “wordle” and then finds the top 500 words with the most matches (for those curious, the top matching words are: orate / roate / oater). It’s not particularly fast on postgresql, clocking around 487s (8m7s) when I run it on my laptop (postgresql running under WSL2). In the previous post, I was able to get it to run in around 17.2s by using some intermediate materializations and partitioning the compute-intensive part of the query to run in parallel (and also using a faster CPU).

With DuckDB, we are doing a little surgery on the query to pull the source data directly out of CSVs. Instead of ‘FROM table’ like in postgresql (where we first load the data to a table and then analyze it next), I am using read_csv_auto in DuckDB to pull the data straight off my harddrive.

FROM read_csv_auto('C:\Users\matso\code\wordle\data\wordle.csv',header=True)

I modified the FROM clause in both of my CTEs, and then ran the query. The results honestly astonished me.

6 seconds in DuckDB vs 487s in Postgresql.

Surely this couldn’t be right! First off – the data wasn’t even LOADED into the database since I was selecting it right off of my disk. I ran it again, 6 seconds.

An 80x increase in performance.

Honestly, I don’t think there is much left to write about here, but I have definitely been contemplating how much time I’ve spent getting pretty skilled at OLTP query optimization only to see DuckDB just do it faster. Obviously, this is not a benchmark, so performance in the real world may vary tremendously, but this is certainly enough for me to really figure out how to get this to play nicely within my analytics stack.

If you want to find the data to reproduce this yourself, you can find the source data here and the base query here.

Footnote: I replicated the same data into SQL Server 2019 and added COLUMNSTORE indexes. Query time for the base query was approx 1m30s. So 3-4x faster than postgresql (unoptimized/tuned), but still much slower than DuckDB.

Connect Snowflake to Excel in Minutes

Data “Self-Serve” is a buzzword that’s managed to stick around for a long time without a solution. However, I’m convinced that we can get partway there with simple data products rooted in familiar tools. One ubiquitious tool? Excel. Nearly everyone uses spreadsheets or similar productivity tools at work.

That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.

Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.

Step by Step Instructions

(1) Install the ODBC Driver

Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.

(2) Configure ODBC Driver

Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)

Under User DSN, select Add…

Select SnowflakeDSIIDriver from the menu

Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.

Note: I found that lots of databases & schemas are available even after choosing some here. Not sure the full limitations, so you can play with options. I put all options in for the primary database I cared about and it worked fine.

Click on Test… to confirm it worked. Here’s the dialog if it did:

(3) Connect to database in Excel

Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC

From the window that pops up, pick the Snowflake connection and select OK

If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.

IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.

Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”

The next menu that pops up will give you various options – pick the second one down saying PivotTable Report

DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.

Parting notes

There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.

(1) Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.

(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).

That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.

Optimizing SQL queries for speed with dbt

Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.

But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.

Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.

the first dbt run of the query

For reference, you can find the query here, but I’ve pulled a point in time copy below.

{{ config( tags=["old"] ) }}

WITH guesses as (
      SELECT 
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      FROM {{ ref( 'wordle' ) }} ), 
answers as (
      select
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      from {{ ref( 'answer' ) }} ), 
crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,
            CASE 
                  WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a2_match,
            CASE 
                  WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a3_match,
            CASE 
                  WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a4_match,
            CASE 
                  WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a5_match
      from guesses
      cross join answers), 
count_answers as (
      select 
            guess,
            answer,
            a1_match + a2_match + a3_match + a4_match + a5_match as total
      from crossjoin), 
maths_agg as (
      select
            guess,
            sum(total),
            avg(total) avg,
            stddev(total),
            max(total),
            min(total)
      from count_answers
      group by guess
      order by avg desc ), 
final as (
      select * 
      from maths_agg )

select * 
from final

The first optimization

The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.

With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.

getting faster

The second optimization

The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.

So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.

one query, now multiple models + 1 DAG

Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.

much better

The third optimization

Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.

crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,

...

      from guesses
      cross join answers

First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.

Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.

increase thread count to 6!

With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:

{{ config(
 tags=["new","opt"],
 materialized="table"
 ) }}

-- Since I have 6 threads, I am creating 6 partitions

SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}
dag-tastic!

Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.

select
guesses.word as guess,
answers.word as answer,

...

from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1 
      AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join  {{ ref( 'answers' ) }} answers

Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.

I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.

why shouldn’t I use all 6 cores?

This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).

17 seconds! pretty good

Real world applications

This optimization, taken as a whole, worked for a few reasons:

  • It’s trivial to add more compute to a problem, although there is real hard costs incurred.
  • The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
  • Databases are great at running queries in parallel.

These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.

Some constraints to consider when optimizing with parallelization:

  • Read/Write throughput maximums
  • Holding the relevant data in memory
  • Compute tx per second

This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.

If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.

Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.

PS – The best two-word combo I could come up using this code is: EARLS + TONIC.

Running a personal SQL server for free

For some, getting into data analytics outside of an academic or work environment can be very challenging – where do you start? Which database do you use? And how do you do it for low or zero cost?

In this article, I am going to walk through setting up your VM1 & database, connecting to your new remote server using Azure Data Studio, and as a bonus, connecting it to dbt. I’ve also written about setting up dbt on windows on a previous post.

First, let’s talk about requirements & recommendations:

  1. This tutorial is focused on Windows 10 + Linux. You will need Windows 10 Pro where you install your VM.
  2. I recommend that you set up your database on different physical machine than your dev machine. You should probably have at least 32GB of RAM.
  3. Since we are installing the database on another machine, that machine needs to be on the same network as your development machine.

Why use a VM at all? In my experience, running a database on your dev machine makes everything extremely slow. Your database will be very greedy with resources (RAM specifically) – so keeping it in a little box that you can turn on and off allows you to keep using your machine “as normal”.

Step 1: Enable HyperV

Open powershell as administrator and run the following command:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All

More info can be found here: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v

Step 2: Create a VM in HyperV

You will need to restart your machine in order to use the HyperV features, so machine sure to do that first. The Microsoft documents to create a VM are exellent – and linked below. Make sure to select Ubuntu 20.04 when you create it.

https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/quick-create-virtual-machine

Step 3: Install SQL Server on your VM

We will do the install of SQL Server2 in the CLI on Ubuntu, which MS has laid out again very nicely in their documentation. A couple of notes when walking through this:

  1. Make sure to select “SQL Server Express” as your edition. It limits your database size to 9GB but is otherwise relatively unencumbered by MS licensing.
  2. Write down your SA password. You will need it later when connecting.

This is quite detailed, so head over to this link and follow the instructions in detail: https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup?view=sql-server-ver15

Step 4: Update the settings of your virtual switch

The default settings inside HyperV is for an “internal network” on your VM. This is fine if you are accessing your VM from the machine its running on, but the whole point here is that you want it to be a “remote server”. Set the virtual switch to “external network” and you can then access your VM from any machine on your network.

Again, MS has great documentation on this here: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/get-started/create-a-virtual-switch-for-hyper-v-virtual-machines

Step 5: Install Azure Data Studio on your dev machine – and write some SQL!

On your dev machine, make sure you can ping your VM. In my case, my VM is named “jacob-virtual-machine”, so the command to validate I can reach it is:

ping jacob-virtual-machine

If you can’t ping your VM, you have some networking issues to sort out. While I am no expert here, you will want to make sure you can see your VM outside the host (Step 4, above) and that port 1433 is open on the host and the VM.

Once that is resolved, you can download and install Azure Data Studio3. Now, with the credentials from above and you VM name, you can connect to your remote server. Everything can be left on defaults, but the avoidance of doubt, check out my connection settings below.

SQL Server Connection Settings

Now you have it all working and you have your own nice empty database to play with!

Bonus Content: Connect dbt to SQL Server

For those of you wishing to use dbt with SQL Server, check out the dbt-sqlserver github. It has great details, but I’ll summarize the key bits.

You will need to install the dbt connector:

pip install dbt-sqlserver

I also find their explanation of the profiles.yml file kind of confusing, so I’ve included my own below for reference:

local_sql:
  target: dev
  outputs:
    dev: 
      type: sqlserver
      driver: 'ODBC Driver 17 for SQL Server'
      server: <VM name>
      database: <database name>
      port: 1433
      schema: <schema name>
      user: <username>
      password: <password>

Footnotes

1 You can also probably do this with WSL2, and not install a Linux VM. However, I am going to be running more software on the VM later and I want to split it to another machine. You can also use docker over top of all of this, which I may cover in another post.
2 I’m choosing SQL Server for a couple reasons: I am familiar with it and the documentation and community are large. PostgreSQL also works here, which has the advantage of having a default dbt connector.
3 SSMS works here too, but Azure Data Studio has the advantage of being cross platform. If you are using dbt, you need a SQL runner anyway as the VS code options aren’t great.

Write Code Last – 4 steps to better dashboards

I gave a talk last week about “Data to Dashboard” and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject – I hope you enjoy it.

Toronto Data Workshop – 6/18/2021

Install dbt on Win10 – April 2021 (Updated October 2021)

I was getting a little frustrated with the web interface of dbt cloud, and just wanted to feel more “in control” of my environment. Which lead to this twitter thread:

Which led to lots of good recommendations (for Atom, POP SQL, vim+tmux, DeepChannel, and some others) but ultimately I settled on VS Code after a few auspicious DMs.

The process to install dbt on Win10 isn’t exactly friendly for an analyst using dbt without engineering experience, so I wanted to share my journey and hopefully make yours easier too.

Step 0: Install Python

Note: These steps have been tested with Python 3.9.6 and earlier. As of October 15th, 2021, Python 3.10.x is not working!

Before you do anything else, install python. Make sure to check the box to “Add Python to PATH”. If you don’t, you can only run it explicitly.

low quality screenshot – but peep the box at the bottom.

If you miss this step, you have two options:
1 – re-install of python and check the “Add Python to PATH” box.
2 – manually add the PATH for python; one example linked here.

To confirm it’s working, open the command line and enter ‘py –version’. It should return the version of python you installed. If you get an error, most likely it is a PATH issue.

Step 1: Install VS Code + MS Build tools

First, download links: VS Code, MS Build tools.

As a quick call out, you need MS Build tools for MSVC v140 or higher, which is an optional component of the C++ build tools.

more low quality images, but check the right box.

The MS Build tool install takes a bit of time, so I would do this when you have a little bit of time (15 minutes or so), and you will need to restart your computer.

Step 2: Config your VS Code Environment

There are a few items that need to be done to get VS Code ready for dbt, so I will list them here.

  • Open the command palette (Ctrl+Shift+P), and type ‘Python: Select Interpreter’. It should then bring up and allow you to select your python interpreter.
  • Add the dbt power user plug-in.
  • Take a look at this article for more suggested plug-ins from the dbt team.
  • Open your command line, and update pip to the latest version with the command ‘pip install –upgrade pip’.
    • If you get an error here, you may need to run it with the ‘–user’ flag as well, but I got that behavior inconsistently.

Step 3: Install dbt on Win10

If you’ve made it this far, congrats. If you are finding this article because your ‘pip install dbt’ isn’t working, go back to the top and work through the above steps first.

Note: As of October 15th, 2021, this is paragraph is no longer required, but I’ve kept it for historical reasons. In your command line, run the following: ‘pip install dbt –no-use-pep517 cryptography‘. The dependency chain is broken somewhere and this flag fixes it. Do I know why? No. Do I care? Also, no.

Go ahead and run ‘pip install dbt’ in your CLI. This takes a few minutes to run (5-10 mins), but when it’s done, you can check by running ‘dbt –version’ in the CLI. It should return the latest version (as of this update, 0.20.1).

Lastly, I recommend running ‘dbt init‘ to set up your initial ‘.dbt’ folder that holds your profiles.yml file to allow you to connect to your data warehouse. If you don’t run it, you will just need to create that file by hand later. Since that file contains your credentials, it is best practice to put that in another place outside of your source control.

And with that, you are ready to connect to your repo and begin working on your dbt project. And lastly – share your Ws on twitter!

Start Simple With Your Analytics Project

Start Simple & Iterate

Up to this point, I’ve largely written for those looking to break into an analytics career. Today I’ll go beyond that and discuss the most powerful lesson I and many others learned — something I wish I fully understood starting out:

Start your analytics project as simple as possible and iterate from there.

This strategy borrows a lot from Agile software development not because I’m a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.

The Common Mistake

I’m going to assume you’ve already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.

Many analysts (myself included!) are then tempted to:

  • Retreat to your office
  • Gather & clean all the data you think everyone needs
  • Build the World’s Best V1 Dashboard
  • Schedule a meeting to present the dashboard
  • Receive unanimous praise for how amazing it is
  • Watch as everyone uses your dashboard daily

What really happens:

  • Retreat to your office
  • Gather & clean only some of the data people need
  • Spend way too long building the Dashboard No One Really Wanted
  • Stakeholders email you intermittently asking if you’re making progress
  • Schedule a meeting to present the dashboard
  • Entire meeting spent fielding questions like “Why don’t I see X or Y?”
  • Get the cold sweats realizing you don’t have what they need
  • Stakeholders frustrated that so much dev time was wasted
  • You’re frustrated that they are “changing what they need”
  • Retreat to your office

Why Does This Happen?

Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you’ll never make that mistake again. No one is immune.

There is one core reason why this happens: You assume you understand what the stakeholder wants.

Except you likely don’t. Especially when you’re early in your career. You’ll think you’re on the same page with your stakeholder, but you aren’t. You think you know what data points the stakeholder needs, but you don’t (hint: the stakeholder likely doesn’t know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don’t.

In fact, it’s so difficult to get everything right the first time, you should assume you don’t fully understand the request. That one time you actually do build “The World’s Best V1 Dashboard”, celebrate the unexpected success – it won’t happen often.

Strategy: Start Simple

There’s a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:

  • Agree with stakeholder on an MVP (Minimum Viable Product) – something small that can be done quickly
    • Your stakeholder may not know exactly what they want, so you may have lots of freedom here
  • Gather & clean only the data you need for the MVP
  • Create MVP dashboard
    • Ask your stakeholder questions here, too! You don’t need to go radio silent and many times they’ll appreciate the feedback loop
  • Present MVP dashboard to stakeholder
  • Gather feedback from stakeholder
  • Start process over again

This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better – you’ll quickly identify misalignments, missing data, new requirements, changing business needs and more.

The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.

Conclusion

Don’t try to build Rome in a day on any analytics project. You’ll rarely succeed. Instead, iterate and build on a project until it becomes something useful – and likely looks nothing like what you thought it would starting out.

Analytics is a dynamic field. Don’t fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.

Medium Data: MS edition

This video is for your data that is too big for an excel spreadsheet and too small for a data warehouse. I like to refer to this as “Medium Data”.

I can think of many times I needed this during my career. Typically, the “medium data” scenarios were related to snapshotting historical data weekly and showing changes in trends over time. One good trick I learned in one of my first jobs was to snapshot my CRM order book every week and save it in a CSV format. Eventually, that got too large for my meager tools, and I started aggregating, losing data, or other hacks (i.e., multiple excel files). Linking excel files together was basically enough to motivate me to learn SQL. With Azure, you can easily scale into the next size of data and keep your analytics rolling. Check the video below for a 15 min walk through.

Going from CSV to SQL in 16 minutes

I’ve just shown the basics – but there are some awesome articles out there that can go more in-depth, including some great automation.

The core tutorial in this video can be found here: https://social.technet.microsoft.com/wiki/contents/articles/52061.t-sql-bulk-insert-azure-csv-blob-into-azure-sql-database.aspx

To really amp it up with automatic import, check out this: https://marczak.io/posts/azure-loading-csv-to-sql/