Running MDS-in-a-box in Docker

I didn’t really set out to learn Docker when I started the MDS-in-a-box project, but as it turns out, Docker is quite a good fit. Part of this is because I desired to run the project in a Github Action, which is a very similar paradigm, and also because I have the notion (TBD) of running a bunch of simulations in AWS Batch. The goal of this post is to show a quick demo and then summarize what I learned – which frankly will also serve as a quick reference for me when I use Docker again.

Running the project in Docker

Once Docker Desktop is installed, building the project is trivial with two ‘make’ scripts.

make docker-build
make docker-run-superset

This takes a few minutes, but once its complete you have a full operational analytics stack running inside your machine.

The first rule of Docker

I learned this one the hard way, as I attempted to add evidence.dev to my existing container. The environment was only based on Python, and I needed to add Node support to it. I tried and tried to modify the dockerfile to get Node working – which leads to the first rule of Docker:

Thou Shalt Use An Existing Base Image

As it turns out, a quick googling revealed that there was already an awesome set of python+node base images. Shout out to this repo which is what I ended up using: Python with Node.js.

Now that I had the Docker container “working” – I needed to actually figure out which docker commands to use.

Docker Quick Reference

These are the commands that I learned and used over and over again as I triaged my way through adding another component to my environment. It is not exhaustive but designed to be a practical list of key commands to help you get started with Docker, too.

  1. docker build – use this to build the image defined in your working directory. In my project, I’m also giving it a name (-t mdsbox) and defining where to save it, so the full command is ‘docker build -t mdsbox .
  2. docker run – use this to run your image as a container once its built. You also pass in your environmental variables as part of docker run, so this command gets a bit long. Unfortunately, this is the first command that you see when learning Docker, which makes it look more imposing and scary than it actually is. The general syntax is ‘docker run <docker config> <CLI command>‘.
  3. docker ps – use this command to see which containers are running. This is so you know which containers to stop or to access (via docker exec) within the CLI.
  4. docker stop – this command stops a container. If you run a container from the terminal, you can’t stop it or exit like a process running in the terminal (i.e. with Ctrl+D), so you need to use ‘docker stop <container name>‘ instead!
  5. docker exec – this command lets you run a command on a running container. I found this be absolutely huge for debugging as you can get right into the terminal on your container and futz around with it. The command I used to access it is ‘docker exec -it <container name> /bin/bash‘ which drops you into the terminal.
  6. –publish – I’m including this Docker flag, since this is the flag you invoke to make your application visible on the network. Used in context, it looks something like this: ‘docker run –publish 3000:3000 <container name>‘. It is simply mapping port 3000 on the host to port 3000 on the container.

There are some notable exclusions, like ‘docker pull‘ but this reference is merely to help get started with MDS-in-a-box. By the way, you can check out the latest deployed version at www.mdsinabox.com!

As a note, I want to thank Pedram Navid & Greg Wilson for being my Docker shepherds – I definitely was stuck a few times and your guidance was incredibly helpful in getting things unstuck!

What Good Data Self-Serve Looks Like

I once was tasked with figuring out how to ‘democratize data’ for internal employees. No other instructions, solely a general pain point of ‘the data team is stuck doing ad-hoc tickets’ and ‘stakeholders want to get data on their own.’ After floundering for a while, I set out to figure out what data self-serve looked like at other companies. Seemed simple enough. But I quickly learned things aren’t that simple, and when are they for cross-functional data projects, anyway?

I want to share what I learned during an earnest effort to stand up data self-serve. I know others are struggling with this same project and its ambiguities and humongous scope. I spent time reading, thinking, attempting, failing, trying again, failing again, trying again and seeing success. Let’s dive in.

Executive Summary

Data rarely moves fast enough across companies to enable data-informed decisions. The data team is a bottleneck behind which many requests stack up in a queue. The data team drowns in questions and stakeholders become frustrated.  Long wait times ensue, forcing the business one of three decisions:

(1) Wait to make a decision
(2) Make a decision without data
(3) Departments hire their own data workers.

The data velocity problem is not reasonably solved through sheer volume of hiring. Data workers are expensive and hard to find. Instead, data teams typically pivot to enabling the business via self-serve and data democratization. Ideally, this unlocks the data team to focus on strategic analyses and initiatives and the business is freed to find the data they need without submitting a ticket.

Effective data teams must pivot away from (or avoid entirely!) taking tickets and into partnership with the organization, focusing on building scalable data solutions from which others can self-serve.

Data Self-Serve Definition

Data self-serve is notoriously hard to define. Any definition is inevitably specific to a company and point in time. The definition below is therefore intentionally generic:

Ability for any employee to quickly find and leverage the data and insights they need for their role without funneling through the Data Team

Good Self-Service Always Looks Different

(and bad self-service always looks the same)

Data self-serve done well, by its very nature, looks very different from company to company. The tools, processes, and organization structure supporting self-serve requires tailoring to the organization, rather than following a blueprint.

In contrast, data self-serve nearly always looks the same when done poorly. The data team is overwhelmed with tickets, business users can’t find data and strategic analyses do not happen. 

Put another way, bad data organizations tend to look the same, but working data organizations look very different from each other

This puts data teams in a liberating but uncomfortable position.  They’re able to freely define and create the self-service experience that their organization needs at this very moment, but at the same time there’s no blueprint for success.  This requires an interactive approach to find the best solution for their company.

The focus must be on the stakeholders’ needs, not the data team’s needs. Data teams have a habit of making data-self serve in their image instead of thinking about who they’re serving. The focus must be and will be squarely on the data self-serve experience of coworkers.

Typical Hurdles to Self-Serve

It’s easy to conjure up a world where someone types or speaks a question into a machine and gets the data they need. This is the ultimate data self-serve utopia and one on full display in the excellent sci-fi show The Expanse. The main characters routinely verbally ask computers questions like “Pull up every ship within X distance which was made by Y company and left Z destination in the last week.”

There are many hurdles before that sort of world is possible. Let’s cover a few of them:

Data operated as a service

Many data teams operate with a “receive a ticket, answer a ticket” mindset. This limits the team’s output into reactive short-term and smaller-scale asks and puts them as a bottleneck between the business and data. This is not an uncommon problem within the data industry. Data service teams are typically overstretched and struggle to answer all the questions coming their way. The business inevitably assumes the data team isn’t a strategic partner as they don’t seem to operate like one. If you operate like a service desk, you’ll be treated like one.

This team structure rarely scales. I recall a discussion with a C-Suite member who criticized the data team as “Getting me answers so late that I’ve forgotten my question by the time they reply.” Ouch.

In contrast, effective BI teams operate with a product-like mindset that focuses on scale and solutions. They partner closely with stakeholders to solve problems and prioritize ruthlessly based on business impact.

Data foundations not yet ready for self-serve

There is an immense amount of work required to get data ready for self-serve.  Just cleaning up a few raw data tables isn’t enough. Each part of the business (Sales, Marketing, Product, etc.) need different sets of data to answer their unique use cases.

Prepping data into the right shape requires close partnership and collaboration between the data team and its internal stakeholders.  This requires steps like data ingestion and transformation, implementing tooling like Git and dbt and having a team that can support the data lifecycle of a company.

Lack of data literacy

Data literacy, much like self-serve, is a tough term to nail down. This deserves its own discussion entirely, but for now let’s go with a typically squishy definition along the lines of “How well stakeholders can interact with and understand data.”

Training for data literacy is immensely difficult. Even if you have the world’s best data models and data marts and the Modern Data Stack™️ stakeholders will struggle to find value if they can’t grok the internal business data model or fall into common data pitfalls. This hurdle must be overcome no matter how well you do everything else.

Lack of data tools that enable self-serve

Typically there are two primary ways that an individual can self-serve data: SQL on a database or look at a data-team-created dashboard.  SQL is great for technical individuals but is not an option for the majority of employees.  Dashboards are usually widely available but lack customization.  Generally dashboards are a “you get what you get” type of experience, with little to no drill down capability and slow turnaround times from data teams for enhancement requests.

Data teams must provide other options for non-SQL savvy users to explore data in a more ad-hoc sense, leveraging curated/enriched tables created for their department. This can look as simple as providing access to enriched data in Excel or “Reverse ETL” where you send data back to source systems like Salesforce for direct consumption in those contexts.

Data privacy

This varies company to company, but data privacy comes into play depending on industry and company size. And whenever privacy is a factor, data access becomes more difficult. Typical lines in the sand are material non-public information (MNPI) when a company is publicly traded or personally identifiable information (PII) that only specific people should have access to.

Data self-serve almost always runs into data privacy concerns and the height of this hurdle (or wall…) will depend on the company.

But…What Does Good Self-Serve Look Like?!

Even though I described at length that good self-serve usually looks different, there are still some guiding principles to shoot for. These may not be universal and may change depending on your company’s data maturity, but they should be helpful.

(1) Focus initial efforts on specific departments/teams

One common bugaboo is an attempt to boil the ocean. The data team is already spread thin and pivoting everyone to self-serve for all departments at the same time will be too much. Instead, focus on a couple teams or departments with clear self-serve needs. Assign a specific analyst or two for the project who already understands a particular business domain and want to take on the challenge.

This both narrows the scope and increases the likelihood of close partnership with those teams. With partnership will come alignment on business value and understanding of pain points. Everyone wins.

(2) Create roadmap in partnership with stakeholders

Self-serve must necessarily look different from department to department. The needs for Finance are wholly different from Product, Engineering, Field, Legal or Marketing.  This is why a focus on self-serve and a dedicated BI partner is so crucial. Requirement gathering and roadmap creation must be done in close collaboration between BI and each department. Examples of requirements to gather include:

Examples of requirements to gather:

* Use cases
* Defining personas (technical/non-technical/etc)
* Tools needed
* Datasets
* Training/Enablement

(3) Build source of truth data marts

A data mart is a set of tables designed for ease of use by a department for their self-serve needs. These tables are specifically curated by the data team to make data easy to consume and understand for a particular department. 

Just providing individuals with access to the entire database is inevitably overwhelming. There could be hundreds of billions of data points across thousands of columns and hundreds to thousands of tables. Many analysts need a year to become comfortable with data at its most granular state. Expecting non-analysts to just hop in and find value isn’t reasonable.

To avoid this steep learning curve, a curated data mart enables self-service without overwhelming stakeholders.  This curated data mart must be built in close collaboration between the data partner and their stakeholders to find the sweet spot of “plenty of data” and “not confusing”.

Example: The sales team needs a few good tables such as Account, Opportunity and Task from which they can build most any report they need.

(4) Create an adoption and discoverability program

Data discoverability is an enormous challenge that must be tackled on several angles.  The existence of data marts alone is not enough to drive adoption if individuals do not know how they exist or do not know how to leverage them.

To drive adoption, efforts must include:

* Training / Onboarding sessions for all stakeholders
* Clear documentation for all data marts, tools available, key reports
* Weekly office hours
* Monthly & quarterly prioritization meetings
* Deprecation process to clean out old/unused data products

Parting Notes

There’s much more to write and I’ll follow up around defining internal stakeholder personas and choosing technologies that solve different aspects of data self-serve. For now, I hope the key message you took away from this is:

“I’m empowered to figure out how to best do self-serve at my organization.”

There’s an art to this task, and that’s why it’s so difficult to find anyone giving a blueprint. There really isn’t one. And you’ll never “arrive” at the conclusion of this project. You’ll just continually improve it, much like you do all your other data efforts. The fun is in the journey.

Passing dynamic parameters to SQL Server stored procedures with dbt

If you are using SQL Server with dbt, odds are that you probably have some stored procedures lurking in your database. And of course, the sql job agent is probably running some of those on a cron. I want to show another way to approach these, using dbt run-operations and GitHub actions. This will allow you to have a path towards moving your codebase into a VCS like git.

Unwrapping your wrapper with jinja

The pattern I am most familiar with is using the sql agent to run a “wrapper”, which servers to initialize the set of variables to pass into your stored procedure. The way I have done this with dbt is a bit different, and split into two steps: 1) writing the variables into a dbt model and 2) passing that query into a table that dbt can iterate on.

Since your model to stuff the variables into a table (step 1) is highly contextual, I’m not going to provide an example, but I will show how to pass an arbitrary sql query into a table. Example below:

{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

For those of you from the SQL Server world – the metaphor here is a temporary table. You can find more about run_query here.

Agate & for loops

What we have created with the run_query macro is an Agate table. This means we can perform any of the Agate operations on this data set, which is pretty neat! In our case, we are going to use a python for loop and pass in the rows of our table.

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% do run_query(stored_procs) %}
    {% set stored_procs = true %}
{% endfor %} 

The clever thing to do here with python is that we can pass multiple columns into our stored procedure, which differs from something like dbt_utils.get_column_values that can also be used as part of a for loop, but only for a single column. In this case we can reference which column to return from our table with variable[n], so i[0] returns the value in the first column in the current row, i[1] returns the second column and so on.

Building the entire macro

Now that we have the guts of this worked out, we can pull it together in an entire macro. I’m adding ‘dry_run’ flag so we can see what the generate SQL is for debugging purposes, without having to execute our procedure. As a side note, you could also build this as a macro that you run as pre or post hook, but in that case you would need to include an ‘if execute‘ block to make sure you don’t run the proc when project is compiled and so on.

-- Execute with: dbt run-operation my_macro --args '{"dry_run": True}'
-- to run the job, run w/o the args

{% macro my_macro(dry_run='false') %}
{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% if dry_run == 'false' %}
        {% do run_query(stored_procs) %}
    {% endif %}
    {% set stored_procs = true %}
{% endfor %}  
{% do log("my_macro completed.", info=True) %}
{% endmacro %}

Running in a Github action

Now that we have the macro, we can execute in dbt with ‘dbt run-operation my_macro’. Of course, this is great when testing but so no great if you want this in production. There are lots of ways you run this: on-run-start, on-run-end, as a pre or post-hook. I am not going to do that in this example, but instead share how you can run this a stand alone operation in github actions. I’ll start with the sample code.

name: run_my_proc

on:
  workflow_dispatch:
    # Inputs the workflow accepts.
    inputs:
      name:
        # Friendly description to be shown in the UI instead of 'name'
        description: 'What is the reason to trigger this manually?'
        # Default value if no value is explicitly provided
        default: 'manual run for my stored procedure'
        # Input has to be provided for the workflow to run
        required: true

env:
  DBT_PROFILES_DIR: ./
  MSSQL_USER: ${{ secrets.MSSQL_USER }}
  MSSQL_PROD: ${{ secrets.MSSQL_PROD }}
  MSSQL_LOGIN: ${{ secrets.MSSQL_LOGIN }}
   
jobs:
  run_my_proc:
    name: run_my_proc
    runs-on: self-hosted

    steps:
      - name: Check out
        uses: actions/checkout@master
      
      - name: Get dependencies # ok guess I need this anyway
        run: dbt deps --target prod

      - name: Run dbt run-operation
        run: dbt run-operation my_macro

As you can see – we are using ‘workflow_dispatch’ as our hook for the job. You can find out more about this in the github actions documentation. So now what we have in github is the ability to run this macro on demand with a button press. Neat!

Closing thoughts

One of the challenges I have experienced with existing analytics projects on SQL Server and dbt is “what do I do about my stored procedures”. They can be very hard to fit into the dbt model in my experience. So this is my attempt at a happy medium where you can continue to use those battle tested stored procedures while continuing build out and migrate towards dbt. Github actions is a simple, nicely documented way to start moving logic away from the sql job agent, and you can run it “on-prem” if you have that requirement. Of course, you can always find me on twitter @matsonj if you have questions or comments!

Three steps to handling sharded databases with dbt

A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:

just keep sharding, just keep sharding…

This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.

I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).

Step 1: leverage YAML anchors and aliases

Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.

https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet

By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.

  - name: BASE_DATABASE
    database: CUSTOMER_N
    schema: DATA
    tables: &SHARD_DATA
      - name: table_one
        identifier: name_that_makes_sense_to_eng_but_not_data
        description: a concise description
      - name: table_two

  - name: CUSTOMER_DATABASE
    database: CUSTOMER_N+1
    schema: DATA
    tables: *SHARD_DATA

Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.

Step 2: Persist a list of your sharded databases

This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.

(SQL SERVER)
SELECT * FROM sys.databases;

(SNOWFLAKE)
SELECT * FROM information_schema.databases

You can then persist that information in a dbt model that you can query later.

The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.

Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.

{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
  SELECT field_list,
    '{{ sn }}' AS source_name
  FROM {{ source( sn , 'table_one' ) }} one
    INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
  {% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!

Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.

(SQL SERVER)

{{ config(
    materialized = "table",
    pre_hook="
      DROP TABLE IF EXISTS #source;
      CREATE TABLE #source
      (
        some_field INT
      );

      {% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
      {% for sn in source_names %}
        SELECT field_list,
          '{{ sn }}' AS source_name
        FROM {{ source( sn , 'table_one' ) }} one
          INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
       {% endfor %} 
       DROP TABLE IF EXISTS target;
       SELECT * INTO target FROM #source",
    post_hook="
      DROP TABLE #source;
      DROP TABLE target;"
  )
}}    

SELECT * FROM target

So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!

Alternative methods: using dbt_utils.union_relations

In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.

Connect Snowflake to Excel in Minutes

Data “Self-Serve” is a buzzword that’s managed to stick around for a long time without a solution. However, I’m convinced that we can get partway there with simple data products rooted in familiar tools. One ubiquitious tool? Excel. Nearly everyone uses spreadsheets or similar productivity tools at work.

That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.

Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.

Step by Step Instructions

(1) Install the ODBC Driver

Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.

(2) Configure ODBC Driver

Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)

Under User DSN, select Add…

Select SnowflakeDSIIDriver from the menu

Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.

Note: I found that lots of databases & schemas are available even after choosing some here. Not sure the full limitations, so you can play with options. I put all options in for the primary database I cared about and it worked fine.

Click on Test… to confirm it worked. Here’s the dialog if it did:

(3) Connect to database in Excel

Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC

From the window that pops up, pick the Snowflake connection and select OK

If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.

IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.

Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”

The next menu that pops up will give you various options – pick the second one down saying PivotTable Report

DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.

Parting notes

There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.

(1) Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.

(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).

That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.

Optimizing SQL queries for speed with dbt

Like most people, I’ve been obsessed with Wordle for the past few weeks. It’s been a fun diversion and the perfect thing to do while sipping a cup of coffee.

But of course, my brain is somewhat broken by SQL and when I saw this GitHub repo courtesy of Derek Visch, I was intrigued by the idea of using SQL to build a Wordle optimizer.

Using his existing queries, I was able to get a list of “optimal” first words. But it took forever! On my laptop, over 900 seconds. Surely this thing could be optimized.

the first dbt run of the query

For reference, you can find the query here, but I’ve pulled a point in time copy below.

{{ config( tags=["old"] ) }}

WITH guesses as (
      SELECT 
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      FROM {{ ref( 'wordle' ) }} ), 
answers as (
      select
            word,
            SUBSTRING(word, 1, 1) letter_one,
            SUBSTRING(word, 2, 1) letter_two,
            SUBSTRING(word, 3, 1) letter_three,
            SUBSTRING(word, 4, 1) letter_four,
            SUBSTRING(word, 5, 1) letter_five
      from {{ ref( 'answer' ) }} ), 
crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,
            CASE 
                  WHEN answers.letter_two in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a2_match,
            CASE 
                  WHEN answers.letter_three in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a3_match,
            CASE 
                  WHEN answers.letter_four in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a4_match,
            CASE 
                  WHEN answers.letter_five in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a5_match
      from guesses
      cross join answers), 
count_answers as (
      select 
            guess,
            answer,
            a1_match + a2_match + a3_match + a4_match + a5_match as total
      from crossjoin), 
maths_agg as (
      select
            guess,
            sum(total),
            avg(total) avg,
            stddev(total),
            max(total),
            min(total)
      from count_answers
      group by guess
      order by avg desc ), 
final as (
      select * 
      from maths_agg )

select * 
from final

The first optimization

The first, most obvious lever to pull on was to increase compute! So I switched to my newly built gaming PC. The environment setup is win 11 pro , dbt 1.0.0, and postgres 14 (via WSL2), running on an AMD 5600G processor with 32GB of RAM, although WSL2 only has access to 8GB of RAM. I will detail the environment setup in another post.

With this increased compute, I was able to reduce run time by 3.4x, from 927s to 272s.

getting faster

The second optimization

The next level was inspecting the query itself and understand where potential bottlenecks could be. There are a couple ways to do this, one of which is using the query planner. In this case, I didn’t do that because I don’t know how to use the postgresql query planner – mostly I’ve used SQL Server so I’m a bit out of my element here.

So I took each CTE apart and made them into views & tables depending complexity. Simple queries that are light on math can be materialized as views, where as more complex, math intensive queries can be materialized as tables. I leveraged the dbt config block in the specific queries I wanted to materialize as tables.

one query, now multiple models + 1 DAG

Simply by strategically using the table materialization, we can increase performance by 9.0x – 272s to 30s.

much better

The third optimization

Visually inspecting the query further, the crossjoin model is particularly nasty as a CTE.

crossjoin as (
      select
            guesses.word as guess,
            answers.word as answer,
            CASE 
                  WHEN answers.letter_one in (guesses.letter_one, guesses.letter_two, guesses.letter_three, guesses.letter_four, guesses.letter_five)   THEN 1
                  ELSE 0
            end as a1_match,

...

      from guesses
      cross join answers

First, there is a fair bit of math on each row. Secondarily, its cross joining a couple large tables and creating a 30m row model. So in round numbers, there are 5 calculations for “guess” times 5 calculations for each “answer”, for 25 calculations per row. Multiply by 25m rows, you get 750m calculations.

Now since I have a pretty robust PC with 6 cores, why not run the dbt project on 6 threads? First things first – lets change our profile to run on 6 threads.

increase thread count to 6!

With that done, I had to partition my biggest table, crossjoin, into blocks that could be processed in parallel. I did this with the following code block:

{{ config(
 tags=["new","opt"],
 materialized="table"
 ) }}

-- Since I have 6 threads, I am creating 6 partitions

SELECT 1 as partition_key, 1 as "start", MAX(id) * 0.167 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 2 as partition_key, MAX(id) * 0.167+1 as "start", MAX(id) * 0.333 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL
SELECT 3 as partition_key, MAX(id) * 0.333+1 as "start", MAX(id) * 0.5 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 4 as partition_key, MAX(id) * 0.5+1 as "start", MAX(id) * 0.667 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 5 as partition_key, MAX(id) * 0.667+1 as "start", MAX(id) *0.833 as "end"
FROM {{ ref( 'guesses_with_id' ) }}
UNION ALL 
SELECT 6 as partition_key, MAX(id) * 0.833+1 as "start", MAX(id) as "end"
FROM {{ ref( 'guesses_with_id' ) }}
dag-tastic!

Then I split my table generation query into 6 parts. I believe this could probably be done with a macro in dbt? But I am not sure, so I did this by hand.

select
guesses.word as guess,
answers.word as answer,

...

from {{ ref( 'guesses_with_id' ) }} guesses
join {{ ref( 'guess_partition' ) }} guess_partition ON partition_key = 1 
      AND guesses.id BETWEEN guess_partition.start AND guess_partition.end
cross join  {{ ref( 'answers' ) }} answers

Then of course, I need a view that sits on top of the 6 blocks and combines them into a single pane for analysis. The resulting query chain looks like this.

I then executed my new code. You can see in htop how all 6 threads are active on Postgres while these queries execute.

why shouldn’t I use all 6 cores?

This results in a run time of 17.2s, a 53.8x improvement from the original query on my laptop and a 15.8x improvement on the initial query on the faster pc. Interestingly, going from 1 thread to 6 threads only gave us a 50% performance increase, so there were bottlenecks elsewhere (Bus? Ram? I am not an expert in these things).

17 seconds! pretty good

Real world applications

This optimization, taken as a whole, worked for a few reasons:

  • It’s trivial to add more compute to a problem, although there is real hard costs incurred.
  • The postgresql query planner was particularly inefficient in handling these CTEs – most likely calculating the same data multiple times. Materializing data as a table prevents these duplicative calculations.
  • Databases are great at running queries in parallel.

These exact optimization steps won’t work for every table, especially if the calculations are not discrete on a row-by-row basis. Since each calculation in core table “crossjoin” is row-based, partitioning it into pieces that can run in parallel is very effective.

Some constraints to consider when optimizing with parallelization:

  • Read/Write throughput maximums
  • Holding the relevant data in memory
  • Compute tx per second

This scenario is purely bottlenecked on compute – so optimizing for less compute in bulk (and then secondarily, more compute in parallel) did not hit local maximums for memory and read/write speeds. As noted above, running the threads in parallel did hit a bottleneck somewhere but I am not sure where.

If you want to try this for yourself, you can find the GitHub project here. It is built for Postgres + dbt-core 1.0.0, so can’t guarantee it works in other environments.

Hat tip to Derek for sparking my curiosity and putting his code out there so that I could use it.

PS – The best two-word combo I could come up using this code is: EARLS + TONIC.

Write Code Last – 4 steps to better dashboards

I gave a talk last week about “Data to Dashboard” and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject – I hope you enjoy it.

Toronto Data Workshop – 6/18/2021

Start Simple With Your Analytics Project

Start Simple & Iterate

Up to this point, I’ve largely written for those looking to break into an analytics career. Today I’ll go beyond that and discuss the most powerful lesson I and many others learned — something I wish I fully understood starting out:

Start your analytics project as simple as possible and iterate from there.

This strategy borrows a lot from Agile software development not because I’m a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.

The Common Mistake

I’m going to assume you’ve already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.

Many analysts (myself included!) are then tempted to:

  • Retreat to your office
  • Gather & clean all the data you think everyone needs
  • Build the World’s Best V1 Dashboard
  • Schedule a meeting to present the dashboard
  • Receive unanimous praise for how amazing it is
  • Watch as everyone uses your dashboard daily

What really happens:

  • Retreat to your office
  • Gather & clean only some of the data people need
  • Spend way too long building the Dashboard No One Really Wanted
  • Stakeholders email you intermittently asking if you’re making progress
  • Schedule a meeting to present the dashboard
  • Entire meeting spent fielding questions like “Why don’t I see X or Y?”
  • Get the cold sweats realizing you don’t have what they need
  • Stakeholders frustrated that so much dev time was wasted
  • You’re frustrated that they are “changing what they need”
  • Retreat to your office

Why Does This Happen?

Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you’ll never make that mistake again. No one is immune.

There is one core reason why this happens: You assume you understand what the stakeholder wants.

Except you likely don’t. Especially when you’re early in your career. You’ll think you’re on the same page with your stakeholder, but you aren’t. You think you know what data points the stakeholder needs, but you don’t (hint: the stakeholder likely doesn’t know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don’t.

In fact, it’s so difficult to get everything right the first time, you should assume you don’t fully understand the request. That one time you actually do build “The World’s Best V1 Dashboard”, celebrate the unexpected success – it won’t happen often.

Strategy: Start Simple

There’s a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:

  • Agree with stakeholder on an MVP (Minimum Viable Product) – something small that can be done quickly
    • Your stakeholder may not know exactly what they want, so you may have lots of freedom here
  • Gather & clean only the data you need for the MVP
  • Create MVP dashboard
    • Ask your stakeholder questions here, too! You don’t need to go radio silent and many times they’ll appreciate the feedback loop
  • Present MVP dashboard to stakeholder
  • Gather feedback from stakeholder
  • Start process over again

This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better – you’ll quickly identify misalignments, missing data, new requirements, changing business needs and more.

The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.

Conclusion

Don’t try to build Rome in a day on any analytics project. You’ll rarely succeed. Instead, iterate and build on a project until it becomes something useful – and likely looks nothing like what you thought it would starting out.

Analytics is a dynamic field. Don’t fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.

4 Soft Skills to Amplify your Analytics Career

Soft Skills

I think Nate really said it best with “Technical ability is overrated.” When I’m looking to make a hire as a manager, there are four skills that I’m looking for when I’m interviewing and continually assessing for my reports.

  • Curiosity – a childlike ability to keep asking, “Why?”
  • Accuracy – balancing perfect vs. “close enough” for your data
  • Organization – ability to break down tasks into small chunks and reliability execute on them
  • Empathy – actively listening & seeking to understand, and communication centered on your audience

I’ll breakdown each of these with characteristics with an example, an interview question testing for that skill, and a way that you can improve in each of these areas.

Curiosity

The first soft-skill to have in your repertoire is Curiosity. This often means you are always asking questions and aren’t afraid of asking them. I find myself often hedging a bit in this area by saying something like, “Sorry to be dense about this, but can you explain?”

When I’m dealing with new subject areas, this often will mean pausing conversations to understand words and what they mean. “Net Sales” often means something different in the sales organization vs. the finance organization, so getting to certainty on terms is critical. Frequently, digging into these types of questions can get uncomfortable, especially if the person asking you for help doesn’t know the answer or can’t define it well.

In interviews, there are a couple of ways to get at this skill. One way is to probe about problem-solving: identifying and solving tough problems. This is a bit open-ended, so making sure to redirect the question to your underlying objective is advised. Another way would be to model the behavior and assess how the candidate handles it. A curious person should be able to match your energy and get excited at the premise of jumping down the rabbit hole on a specific subject.

Not everyone comes by this skill intuitively, and for those of you in that bucket, there are some great frameworks to unlock a curious mind. My favorite comes from Sakichi Toyoda, of Toyota fame, and is called the “5 Whys”. More on this below, from Wikipedia.

The key is to encourage the trouble-shooter to avoid assumptions and logic traps and instead trace the chain of causality in direct increments from the effect through any layers of abstraction to a root cause that still has some connection to the original problem.

https://en.wikipedia.org/wiki/Five_whys

Accuracy

Up next, we have accuracy. As a data analyst, it is critical to be right an overwhelming majority of the time. You don’t need to be perfect, and in fact, perfect is the enemy of good. This can be tricky to do well because, as an analyst, you usually are the least knowledgeable person in the domain of the problem at hand. Getting accurate goes hand-in-hand with Curiosity because you must constantly bring assumptions to light. There is a lot of digging to do.

One particularly thorny problem to deal with is sales data within a CRM. It is highly speculative. It changes frequently. In short, it is unreliable. I have found working with this type of data greatly benefits from a common snapshotting period. Just take a backup every Friday at 5 pm. Do your analysis on a static copy, and figure out how to surface changes to key data fields (like close date or opportunity size). This allows you to bring accuracy to constantly shifting data sets.

Part-and-parcel with my comment above, in interviews, I like to probe around “soft data” and see how the prospective analyst has added certainty when data is vague or unreliable. For analysts with finance experience, asking questions about how they dealt with financial periods and month-end processes since those can bottleneck key metrics (like revenue). In that same vein, asking how, in detail, key metrics were calculated will shed some light on their accuracy approach.

Part of why I favor people with accounting backgrounds coming into analytics roles is that accuracy comes with the package. This learned through long hours of grinding out Excel spreadsheets and cross footing numbers one, two, three times. Thankfully, there some tricks to getting better at Accuracy, and it comes along with Curiosity. My favorite is applying a simple checksum technique and ensuring my source & target are equal on an aggregate basis. If it’s wrong – take time to dig into why and understand why what you thought was correct isn’t. The second trick is a checklist, especially for common tasks (say, deploying code to production). Taking time to document exactly how something is done not only reduces the cognitive load for the next time but improves the quality of your work.

Organization

To me, being organized is not about a tidy desk. It is about being to tackle a problem from beginning to end. It means being able to cut through ambiguity and deliver something excellent. An underrated thing about being organized is the skill of breaking down big, hairy problems into small, actionable next steps. To me, an organized person always knows what to do next.

One example where I think Organization comes into play is the meetings you take with your stakeholders. I always try to make sure to recap actions at the end, as a habit to do every meeting that I am in. If I have actions, I’ll write them down but leave others to track their own action.

In an interview, I’ll test for this skill by probing about systems of work. “How do you organize yourself?” or “How do you know what to do next?” are questions that can get to the heart of this. For an analyst, this “system of work” is at the heart of getting more interesting work. Shipping early and often is critical to getting into the more interesting bits of work, so building your own work system is critical.

Getting better at Organization can come in many forms, but looking back on my career, this didn’t come naturally to me. I recall missing a key deadline for some sales analysis for my CSMO. He asked me why I didn’t have what he needed, and I replied, “I was busy.” His response, which put the fear of God into me, was “we are all busy” as he rolled his eyes. Needless to say, I went and picked up David Allen’s Getting Things Done. If you don’t have a work system, GTD is a great starting point, and I still use bits & pieces every day.

Empathy

You want to seek to understand and always bring empathy to the conversation with your stakeholders. They are taking the time to teach you about their business and its problems, so be a respectful skeptic. Remember, you want these folks to back to you the next time they have a problem! Empathy also means centering your communication on your stakeholders, so they feel heard even when you are giving bad news.

One thing I did as I got more comfortable in my analyst role was “rounding” with key stakeholders. This would mean making space to get coffee, drop-in late afternoon as things are winding down, and occasional lunches. These conversations were often more personal than business, but by the time conversation turned to work, we were both comfortable and ready to listen to each other. At one point, I had a couch in my office, and we joked that people would come by for therapy1.

Testing for empathy in an interview is a bit of a challenge. For me, I try to observe if they are listening versus waiting for their time to speak. I can go on for a bit too long at times, so when I catch myself wandering, I also casually check for body language in the candidate to see if they are really listening. Ultimately, this is one of the hardest skills to judge in an interview for me. But I’m actively trying to find ways to measure this quickly and accurately.

As an analyst, you are pretty sharp and usually have a good idea of how to solve a problem as soon as you hear it. To be more empathetic, slow down. Fall in love with the problem. You need to see the problem clearly enough that you can come up with a solution that exceeds your stakeholder’s expectations. Repeat back what you think you heard. This is especially important if the person across the table from you is from a different background, as cultural context can get in the way of great communication. People should always leave a meeting with you feeling like they were listened to.

Final Notes

With Curiosity, Accuracy, Organization, & Empathy, you can be a great analyst. These characteristics all build on each other and help you build a reputation as a reliable, skillful person who can deliver business value. People will seek you as the analyst to solve their problems. Yes – it’s great if you write some SQL, python, or R, but these soft skills will allow you to be 10x greater than someone much stronger technically. After all, I truly believe that success as an analyst should be measured by is how they enable the people around them. A great analyst doesn’t 10x themselves; they 2x (or more!) everyone around them.

—-

1This is REALLY HARD to do in a remote environment. No idea how to replicate this digitally but I’m sure there is a way…

New Weekly Series: Everything Analytics

Do you enjoy working with data in your current role? Are you interested in a Data Analytics career? Are you currently a Data Analyst?

Good news! This weekly series is for you. It’ll cover all sorts of topics within analytics, including advice for aspiring analysts, best practices, key skills/tools and industry updates.

Initial blog topics include:

  • The Many Wandering Paths to Analytics
  • Analytics Job/Role Types
  • Key Skill Sets for Analysts
  • Visualization Best Practices
  • Measuring Success of Analysts
  • How to Prioritize Your Work Backlog
  • …and more!

Much of this will be written from my perspective as an Analyst. There are other perspectives out there for unique positions like Data Scientists and Data Engineering, and while I’ll touch on those regularly (and will write an entire post on the difference between those roles), the focus here will be Data Analysts.

See you in a week!