analytics Archives – DataDuel.co

If spreadsheets are eternal, are BI tools transitory?

A few months ago, Benn Stancil wrote about the eternal spreadsheet. While I appreciated the generous shout out to both mdsinabox and motherduck (my employer), this really got the wheels turning around something that I have been feeling but have only been able to put into words recently: are BI tools transitory?

Consider the following scenario in the Microsoft stack: ingest data with ADF, transform with Fabric (& maybe dbt?), build a semantic model in Power BI, and delicately craft an artisanal dashboard (with your mouse). Then your stakeholder takes a look at your dashboard, navigates to the top left corner and clicks “Analyze In Excel”. How did we get here?

I remember back in the 90s, hearing my dad talk about “killer apps”. The killer app was the app that made your whole platform work. If you wanted your platform to be adopted, it needed a killer app so good that users would switch their hardware and software (an expensive proposition at the time) so they could use the killer app. In my lifetime, I recall a few killer apps: The App Store (iOS), the web (Internet), and Spreadsheets (personal computing).

Spreadsheets allowed a user to make an update in one place and for the data to flow to another place. If this concept seems oddly like a directed-acyclic graph (DAG), that’s because it is. These spreadsheets contain a bunch of features that we find handy in the data solutions stack today: snapshots (save as), version control (file naming), sharing (network shares & email attachments), business intelligence (pivot tables & charts), file interoperability (reading csv, json, xml etc), transformation (power query (there was an earlier, even more cursed version too)). All of these pieces have obvious metaphors in the commonly used data stacks today. Critically, one piece is missing: orchestration (note: back in mid 2010s, I used and loved an excel plugin called “Jet Reports” that included an orchestrator, among other things).
Now if you were running a business in the 90s (like these guys), there was no need for orchestration in your spreadsheet. You, the business user, were the orchestrator. Your data came from many places – memos (later, emails), research (books, later pdfs), a filing cabinet (later, databases), phone calls (later, slack), meetings (later, zoom calls), and your own synthesis (later, chatGPT (just kidding)). Software could not contain these! We did not have the digital twins for these analog processes. In some ways, the spreadsheet was the perfect digital collection point for these physical artifacts.

As each of these parts of our business decision making input processes transitioned to digital, our poor spreadsheet began to fall out of favor. We replaced memos with emails, phone calls with IM (via skype, if you are old enough to remember), and so on. And these digital processes began to produce loads of data. Every step produced an event that was stored in a database. The pace of change in the business environment increased in-kind. Our once per month spreadsheets orchestrated by humans were a bit too slow, processes produced too much data to be aggregated by humans. I fondly recall the launch of excel 2007, which included a new architecture and file format, so that we could process one million rows instead of only 65,536.

Unfortunately, the hardware at the time could not actually handle one million rows. Every single person using excel, unencumbered by 32bit row limits, ran into the hard limits of the Excel architecture and inevitably seeing a spinning, “waiting for excel” icon before crashing (hopefully you saved recently). Hilariously, Microsoft trained users to tolerate an absolutely terrible experience. Excel could do too much. What we needed to do was unbundle this tool, take it apart piece-by-piece, so that we could have good, delightful experiences for our business users. Users could still use spreadsheets for things, but we needed to shift the load bearing intelligence of our business decision making into better tools.

So we built even more powerful databases, and ways to automate decision making at scale. We began to use multiple computers, running in parallel, to solve these problems for us. Large complex systems like Hadoop were required to aggregate all this data. Companies like Google harnessed the immense scale enabled by these systems to become the largest in the world, building never-before-seen products and experiences.

At the same time, CPU clock speeds stopped increasing. We had maxed the number of cycles we could push out of the silicon in the fabs. But innovation found a way to continue – we began to add more cores. Slowly but surely Moore’s law kept on holding, not on clock speed but on throughput.

The software built to take advantage of the scale that was possible with huge quantities of networked computers made assumptions about how to work at great scale across many machines (i.e. Spark). These assumptions did not generalize to single machines with many cores. This has not been unnoticed, of course (see George’s tweet).

So what happened to our business intelligence while this was going on? The number of tools exploded, while the consumption interface remained unchanged. Categories were split into sub-categories into sub-categories. We only had so many charting primitives, and whether we dragged and dropped with Tableau or used BI as code in Evidence, the output looked largely the same. But instead of one tool that we needed in the 90s, we now had thousands.

But I would argue we haven’t added anything new, we’ve merely unbundled it into a bunch of different products and that don’t work that great together. REST APIs have allowed scalable, loosely coupled systems but really suck to work with. Behind every large enterprise data workflow is an SFTP server with CSVs sitting on it (if you are lucky, its object storage and a compressed format, but its the same thing).

If we look at the trends, in 5 years we will have approx. 10x more compute than we do today, and Backblaze estimates that cost per GB of storage will stabilize around 0.01 / GB ($10/TB). If these trends hold, we will easily have enough horsepower on our laptops to put all these pieces that we have decoupled over time, into one box. If BI tools are transitory, spreadsheets are eternal. The era of spreadsheets 2.0 will be upon us.

What are the characteristics of Spreadsheets 2.0?

Runs on a single node with many cores (hundreds?)
One file format for handling all types of data (xlsx++)
One language for end-to-end data manipulation (sql)
A spreadsheet UI for interacting with data at any step in the data manipulation (power query-ish)
Fast, interactive charting (mosaic)
Intelligent, incemental orchestration (dynamic dags)
An AI trained on all these parts above to assist the user in documentation, understanding, and building (clippy but good)

I believe the first building block of this has emerged in front of our eyes: DuckDB. The hardware is being built as we speak (the feedback loop will build it whether we like it or not). Julian Hyde is advocating for “completing the SQL spec to handle metrics” (with apologies to Malloy) – humans have refined this language over the last 50 years and will continue to do it for the next 50. We already have the UI primitives (Excel), so we merely need to bolt these together.

It’s time for the humble spreadsheet to RETVRN. It’s time to bring humans back into the workflow, empowered by AI, to own the data ingestion, transformation, and synthesis required to make decisions. Of course, I’m tinkering with this idea today, if you are interested in what I have so far, please reach out, I would love to talk.

LLMs in SQL? A real-world application to clean up your CRM data

Executive Summary

Firmographic data, like industry classification, employee count, and location, is notoriously messy and challenging to analyze. Modern databases offer a unique opportunity to clean and maintain this data: by leveraging an LLM, you can synthesize industry information for your accounts directly within a SQL query.

I recently took on exactly this challenge, so this isn’t an abstract solution. I’m using generic data below for data privacy reasons, but this is a real-world solution for a real-world problem I solved for a RevOps team.

h/t to Jacob Matson who demonstrated the underlying concept in MotherDuck:

Are you kidding me? Look at this: https://t.co/V8RIK3fpYe pic.twitter.com/OP13wcEiIu
— Jacob Matson (@matsonj) October 17, 2024

Problem 1: Maintaining Firmographic Data is a Challenge

One dreaded question for analysts on the sales or revenue team: “quick qq: Can I get our sales broken out by industry?” This question often signals a few late nights spent cleaning up industry data to provide a rough answer to stakeholders who, meanwhile, are wondering, “Why is this taking so long? Don’t we already pay for firmographic data?”

How often does industry data look like this? Way too often.

The never-ending hamster wheel of maintaining firmographic data is a universal challenge. There’s a reason why Dun & Bradstreet took in $2.3B in revenue in 2023 providing this data: it’s not easy to aggregate or keep up to date.

Problem 2: Overwhelming Industry Categories

There is another visible failure point in the chart above: an excessive number of industry categories.

A hidden aspect of any industry cleanup project is narrowing down to a manageable list of 10–15 industry categories. Leave the intricate sub-industries for later—they’re harder to get right, demand 10 times the maintenance effort, and are rarely used. Only address them when there’s a strong business case to do so.

Solving Problem #2, First

Before we can get to the neat LLM-in-SQL work, we need a clear list of industries to ask the LLM to leverage. Usually you want to make this list in collaboration with business, but GPT-4o can generate a solid starting point as well. Here’s the final list I landed on after refining GPT-4o’s starting point to match my preferences::

Technology
Healthcare & Life Sciences
Financial Services
Retail & Consumer Goods
Manufacturing & Industrial
Education
Public Sector & Government
Energy & Utilities
Real Estate & Construction
Transportation & Logistics
Hospitality & Recreation
Professional Services
Nonprofit & Social Impact
Media & Entertainment

Keep in mind, this isn’t a trivial task if your company hasn’t tackled it seriously before. There are a near-infinite amount of industry breakdowns out there, so you need to give purposeful thought to how to group industries based on your own customer base. Your list will likely differ from` others.

Solving Problem #1 Using an LLM in SQL

Now that we have an industry list, it’s time to ask an LLM to help. You could always upload that CSV to ChatGPT and ask it there, but that requires manual steps (download csv / upload to GPT / download csv / upload to DWH) which could get in the way of automating this in the future.

But Snowflake gives us the option to ask an LLM these questions directly in SQL, directly using context from our CRM and other data. Let’s take advantage!

For the test, I found 25 companies in my local area and wrote down only their names. Some are national brands or chains, while others are small mom-and-pop shops and everything in between. This gives me a list like the one below:

OK great! Now we need to get the syntax for querying an LLM in Snowflake. There’s plenty of documentation covering the available LLMs and functions, and the specific function we want to use is Complete. I tested all of the LLMs in Snowflake for this task, and the most consistently accurate one with reasonable processing times was llama3.1-70b, so we’ll use that LLM for now.

And here were the results. They are … uh … about what one expects on a first try. Overly verbose.

OK, time to do some “Prompt Engineering” to get this correct, because getting a paragraph for a response is incredibly not helpful.

Specifically, I added a line saying: Only return the industry in each response, I want no other text or explanations.

This works well! Very limited extra text, but it isn’t really categorizing like I want it to. For instance, I consider the After Image Styling Salon to be a Professional Service, not a Retail/Consumer Good.

The LLM would benefit from additional context for each industry for decision making. This is exactly what I would do if giving this task to someone at work. Given ambiguity, both a person and an LLM make arbitrary decisions.

Here’s the adjusted query:

Adjusted query with details for each industry included

The context helped! Notably, I found additional improvement when I described specific decision points in the text. Once I told the prompt that places like Auto Repair and Beauty were Professional Services, it correctly categorized them. Lesson being, if you are sure a specific company type should be in a specific industry, be explicit.

Here’s the new output:

Is this perfect? No. but it’s a great start given that it’s working solely off company name and several of these are small or family-owned.

Green = Match, Red = Not a Match, Yellow = Close

Overall:

16 of the 25 I consider correct or nearly correct (72%)
7 of the 25 I consider incorrect (28%)

Not a bad hit rate for a few minutes of an LLM prompt, with room to improve!

I get why names like “Taco Book” threw the LLM for a loop. That’s a local taco shop, but with the name “Book” in the title and very little online about the company, I get why the coin flip ended up on the side of Retail & Consumer Goods.

What next?

I foresee potential to use additional CRM data to make this approach even more accurate. Company descriptions, headquarters data, sales emails, etc. If it can get to ~7/10 off of name alone imagine the possible improvement with more refined prompt engineering combined with extra data context would .

For example, if you had information in your CRM about Taco Book as a restaurant, the LLM would pick up the industry rather quickly.

This feels like an application of LLMs in SQL that’s only scratching the surface of what is possible. RevOps teams everywhere may soon have a tool to update that pesky CRM data, and surely there are other use cases for this across other domains.

I’d love to hear from you if you are working on something similar!

Defining Analytics Engineering

No one knows what it means, but it’s provocative. It gets the people going!
Chazz Michael Michaels

Context: This is the first of a series of posts looking at Analytics Engineering as I’ve lived and breathed it since 2014, largely in the SaaS industry. This is not meant to provide universal truths, but rather give insight into one way to make sense of the data roles around us and how analytics engineering fits into them.

Definition and why the role matters

Analytics Engineering is the process of transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

This is a fairly recent title, popularized by the data transformation tool dbt in this writeup by Claire Carroll. It didn’t really take off until the advent of cloud data warehouses

At first, this role doesn’t seem necessary. Can’t people just … export data into Excel and do analysis? Haven’t data analysts been working for decades without “Analytics Engineers” existing?

Two replies:

Yeah, you can. Which works … until it doesn’t. But eventually scale becomes a problem as people waste time as Excel jockeying instead of whatever else they should be doing
Data analysts and others were already doing this work, it just didn’t have a name yet or was tool-specific (e.g., data munging in SQL Server Integration Studio)

I found myself doing analytics engineering in my very first data job 10 years ago. My title was “data analyst” and I described myself that way, but my work was curating datasets for executives and other business teams to leverage for decision making. The prior Excel-export model was running into scaling issues and I was able to save everyone time by automating data availability.

Analytics Engineering is both old and new. People have been doing the work for a long time, even if the title is relatively new.

How Analytics Engineering fits within the data ecosystem

A simplified view of the three main pillars of the data world

Analytics Engineering fits roughly between two types of data roles many are familiar with: Data Engineering and Data Analytics/Science. The far-too-simple dividing line between the roles is:

Data Engineering: Brings raw data from disparate systems into a single data warehouse

Analytics Engineering: Transforms the raw data within the warehouse into something useful by cleaning, adding business logic, etc,

Data Analytics/Science: Leverages datasets created by analytics engineers to produce analyses, dashboards, models and more for stakeholders across the organization

Reality: Things aren’t that clear cut

Life is never as nice as a graphic of an over-simplified view of the world. Analytics Engineering is no different, it’s a messy role in the middle of messy data. Here’s a closer representation to how these three categories fit together:

I could make this a lot messier. The lines between data teams are incredibly blurry.

Data engineers typically do some analytics engineering. Data analysts typically do analytics engineering. Analytics engineers do some of both.

Which makes sense! With the roles ill defined at most companies (including Analytics Engineering missing entirely!), you’ll get a mishmash of people doing a variety of these tasks. Often times, people are indirectly incentivized to blur the lines in their role to get a task or two done.

What’s next?

We’ll dive deeper into the core value prop of an analytics engineer: transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

Running uv in dev containers & github actions

Over the weekend I spent some time getting uv running on mdsinabox.com to see what the hubbub was about. As it turns out, it was harder than expected because of permission issues inside of dev containers & github actions.

The existing documentation on the uv github repo as well as docker instructions from ryxcommar’s blog are not pointed at my scenario, which is running it in a docker image and in CI. This blog post is up so if others run into this issue, they can find it and add it to their set up as well.

How to run uv in a dev container

Since we are using system python with uv, we need to tweak some settings in our dev container. There are two changes to make: (1) run as root user, and (2) add ““chmod 777 /tmp to your postCreateCommand. In your devcontainer.json, add or modify the following lines:

"postCreateCommand": "chmod 777 /tmp",
"remoteUser": "root"

Then you can run `uv pip install --system -r requirements.txt in your devcontainer to add libraries as needed.

How to run uv in Github Actions

Now that we are using system python in our dev container, we also need to add one step to get the perms setup in CI. And that step is to add a python setup step in the Github action before running uv.

...    
    steps:
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.11'
...

Using the “actions/setup-python@v2” github action step will set up your runtime environment to properly interact with `uv pip install --system. Shout out to Charlie, of course, who very helpfully PR’d this into the mdsinabox repo.

Hope you find this useful! Please feel free to drop me a line on twitter @matsonj if you have any comments or feedback.

Analysis of Super Bowl Squares

Analysis & Narrative Written by Jacob Matson & Matt Levine, February 2024. [pdf]

Executive Summary

The variant of “Super Bowl Squares” that we analyzed is one in which the entrant is assigned a digit (0-9) for Team A’s final score to end with and a digit for Team B’s final score to end with ¹
We compiled the final game scores from the 30 most recent NFL seasons to determine the frequency that each of the 100 potential “Squares” has been scored a winner
We then compared these frequencies with the publicly available betting odds offered on the ‘Super Bowl Squares – Final Result’ market by DraftKings Sportsbook to ascertain the expected value (EV) of each square
The analysis determined that all 100 of the available squares carried a negative expected value ranging from [-4.0% to -95.2%], and that buying all 100 squares would carry a negative expected value of approximately [-39.7%]

Our Methodology

We collected final game scores data from Pro Football Reference for the last 30 full NFL seasons, as well as the current NFL season through the completion of Week 17. We also included all Super Bowl games that took place prior to 30 seasons ago
Games that ended in a tie were excluded since that is not a potential outcome for the Super Bowl
We calculated raw frequencies for each of the 100 available squares, and then weighted the Niners’ digit 55% to the digit represented by the winner of the historical games, and 45% to the digit represented by the loser of the historical games. The [55% / 45%] weighting is reflective of the estimated win probability implied by the de-vigged Pinnacle Super Bowl Winner odds of ‘-129 / +117’ ² ³
The weighted frequencies were then multiplied by the gross payouts implied by DraftKings Sportsbook Super Bowl Squares – Final Result odds ²

Findings & Results

Raw Frequencies

Sample Size: n = 8,162 games

Most frequent digit for losing team is ‘0’, occurring ~20.5% of the time
Most frequent digit for winning team is ‘7’, occurring ~15.5% of the time

Losing Digit	Winning Digit	Frequency
7	0	3.99%
0	3	3.97%
4	7	3.47%
0	7	3.32%
0	4	3.11%

Top 5 most frequent winning squares

Weighted Frequencies

Sample Size: n = 8,162 games

Most frequent digit for Niners is ‘7’, occurring ~16.9% of the time
Most frequent digit for Chiefs is ‘0’, occurring ~17.4% of the time

Niners Digit	Chiefs Digit	Frequency
0	7	3.69%
7	0	3.62%
7	4	3.30%
3	0	3.29%
4	7	3.27%

Top 5 most frequent winning squares

Expected Value by Square⁵

Niners Digit	Chiefs Digit	Expected Value⁶
0	7	(4.04%)
3	0	(4.55%)
7	0	(5.79%)

Top 3 Best Expected Value Squares

Niners Digit	Chiefs Digit	Expected Value
2	2	(95.19%)
5	5	(77.80%)
2	5	(74.08%)

Top 3 Worst Expected Value Squares

Raw Frequencies for Total Points u47.5

Sample Size: n = 5,127 games

Most frequent digit for losing team is ‘0’, occurring ~25.2% of the time
Most frequent digit for winning team is ‘4’, occurring ~16.3% of the time

Loser Digit	Winner Digit	Frequency
0	3	5.68%
7	0	5.09%
0	7	4.76%
0	4	3.98%
7	4	3.39%

Top 5 most frequent winning squares

Raw Frequencies for Total Points o47.5

Sample Size: n = 3,035 games

Most frequent digit for losing team is ‘4’, occurring ~20.5% of the time
Most frequent digit for winning team is ‘1’, occurring ~17.9% of the time

Loser Digit	Winner Digit	Frequency
4	7	6.10%
7	1	4.09%
4	1	3.39%
1	4	2.97%
0	8	2.93%

Top 5 most frequent winning squares

Selected Conclusions

Participating in the “Super Bowl Squares – Final Result” market on DraftKings Sportsbook has a substantially negative overall expected value, and likely has a negative expected value for every single one of the 100 available squares
- This conclusion is logically continuous with the fact that the probabilities implied by DraftKings’ available odds sum to a total of ~165.9%; the market has substantial “juice” or “vig” overall
- The available odds on relatively common squares (e.g., [0:7], [3:0], [7:0]) are much closer to “fair” vs. the rarest square outcomes (e.g., [2:2], [5:5], [2:5])
  - This strategy by DraftKings entices bettors to place a substantial dollar volume of wagers on the “almost fair” squares that have a reasonable chance of winning
  - Secondarily, it mitigates the negative financial impact to DraftKings that could arise in the event of a “black swan” final game score, such as [15 – 5] or [22 – 12]
- A participant who has a bias towards a “high-scoring” vs. “low-scoring” game would place materially different value on certain square outcomes. Amongst the most pronouncedly:
- If one believes the game will be “low-scoring”, he should greatly value the losing team’s digit ‘0’, which occurs in 25.2% of low-scoring games in the dataset, but only in 12.7% of high-scoring games in the dataset
- If one believes the game will be “high-scoring”, he should greatly value the winning team’s digit ‘1’, which occurs in 17.9% of high-scoring games in the dataset, but only in 9.8% of low-scoring games in the dataset

Areas for Research Expansion

The most substantial limitation in our analysis is that the square frequencies are derived solely from historical game logs, as opposed to a Monte Carlo simulation model of this year’s Super Bowl matchup
- As such, an analyst of this data is forced to balance (i) choosing the subset of games that are most comparable to the game being predicted, and (ii) leaving a sufficiently large number of games in the dataset to mitigate the impact of outlier game results
The variant of Super Bowl Squares that we analyzed (“Final Result”) is one of several commonly played variants, each of which has its quirks that would impact the analysis. Perhaps the most common is the variant in which winning squares are determined by the digits in the score at the end of ANY quarter (as opposed to only at the end of the game)
Further analysis could yield interesting insights regarding how the value of a given square changes as the game progresses. As an example, say that a team scores a safety (worth two points) in the 1st quarter of the game. Which final square results would see the greatest increase in estimated probability? Which would see the greatest decrease? Are there any squares that would only be minimally impacted?

See ‘Appendix A’ for elaboration on the winning criteria for this variant. ↩︎
Pinnacle Super Bowl Winner odds and DraftKings Sportsbook Super Bowl Squares – Final Result odds were both updated as of approximately 9 PM EST on February 9, 2024. ↩︎
See ‘Appendix B’ for elaboration on the benefit and detailed methodology of weighting the raw square values relative to win probability. ↩︎
Pinnacle Super Bowl Winner odds and DraftKings Sportsbook Super Bowl Squares – Final Result odds were both updated as of approximately 9 PM EST on February 9, 2024. ↩︎
See ‘Appendix C’ for the DraftKings Sportsbook odds that were applied to each square in order to calculate expected value. Odds were updated as of approximately 9 PM EST on February 9, 2024. ↩︎
Parentheses reflect negative values. For example, “(5.42%)” would reflect a negative expected value of 5.42%. ↩︎

Appendix A: Winning Criteria

The variant of “Super Bowl Squares” that we analyzed is settled based on the final digit of each team’s score once the game has been completed
Both teams’ digits must match for a square to be deemed a winner. As such, there are 100 potential outcomes, and there will always be exactly 1 victorious square out of these 100 potential outcomes.
A partial set of the final scores that would result in victory for an entrant with the square “Chiefs 7 – Niners 3” are as follows:

Chiefs 7 / Niners 3	Chiefs 7 / Niners 13	Chiefs 7 / Niners 23	Chiefs 7 / Niners 33
Chiefs 17 / Niners 3	Chiefs 17 / Niners 13	Chiefs 17 / Niners 23	Chiefs 17 / Niners 33
Chiefs 27 / Niners 3	Chiefs 27 / Niners 13	Chiefs 27 / Niners 23	Chiefs 27 / Niners 33

Appendix B: Weighted Square Value

Weighting is reflective of the estimated win probability implied by the de-vigged Pinnacle Super Bowl Winner odds of ‘-129 / +117’ [55% / 45% ]

Key Insight: If the winner is known, the square “Winner 1:0 Loser” increases from 1.2% to 2.2% probability, roughly doubling.

Appendix C: DraftKings Sportsbook Available Odds

The ‘e’ in Hex is for ELT

quick note: the justification for doing this is worth like a 17 page manifesto. I’m focusing on the how, and maybe I’ll eventually write the manifesto.

General Approach

This specific problem is loading Point-of-Sale data for a vertical specific system into a database for analysis on a daily basis, but could be generalized to most small/medium data use cases where ~24 hour latency is totally fine.

The ELT pipeline uses Hex Notebooks and dbt jobs, both orchestrated independently with crons. dbt is responsible for creating all tables and handling grants as well as data transformation, while Hex handles extract and load from a set of REST APIs into the database. Hex loads into a “queue” of sorts – simply a table in Snowflake that can take JSON pages and some metadata. Conceptually, it looks like this.

Loading data with Hex

Since Hex is a python notebook running inside of managed infrastructure, we can skip the nonsense of environment management, VMs, orchestration, and so on and just get to loading data. First things first, lets add the snowflake connector to our environment.

Bash

!pip3 install snowflake-connector-python

Now that we have added that package our environment, we can build our python functions. I’ve added some simple documentation below.

Python

import requests
import os
import json
import snowflake.connector
from snowflake.connector.errors import ProgrammingError
from datetime import datetime

# login to snowflake
def snowflake_login():
    connection = snowflake.connector.connect(
        user=SNOWFLAKE_USER,
        password=SNOWFLAKE_PASSWORD,
        account=SNOWFLAKE_ACCOUNT,
        database=os.getenv('SNOWFLAKE_DATABASE'),
        schema=os.getenv('SNOWFLAKE_SCHEMA'),
        warehouse=os.getenv('SNOWFLAKE_WAREHOUSE'),
    )

    # print the database and schema
    print(f"Connected to database '{os.getenv('SNOWFLAKE_DATABASE')}' and schema '{os.getenv('SNOWFLAKE_SCHEMA')}'")

    return connection

# get the last run date for a specific endpoint and store from snowflake
def last_run_date(conn, table_name, store_name):
    cur = conn.cursor()
    try:
        # Endpoints take UTC time zone
        print(f"SELECT MAX(UPDATED_AT) FROM PROD_PREP.{table_name} WHERE store_name = '{store_name}';")
        query = f"SELECT MAX(UPDATED_AT) FROM PROD_PREP.{table_name} WHERE store_name = '{store_name}'"
        cur.execute(query)
        result = cur.fetchone()[0]
        try:
            result_date = datetime.strptime(str(result).strip("(),'"), '%Y-%m-%d %H:%M:%S').date()
        except ValueError:
            # handle the case when result is None or not in the expected format
            try:
                result_date = datetime.strptime(str(result).strip("(),'"), '%Y-%m-%d %H:%M:%S.%f').date()
            except ValueError:
                print(f"error: Cannot handle datetime format. Triggering full refresh.")
                result_date = '1900-01-01'
    except ProgrammingError as e:
        if e.errno == 2003:
            print(f'error: Table {table_name} does not exist in Snowflake. Triggering full refresh.')
            # this will trigger a full refresh if there is an error, so be careful here
            result_date = '1900-01-01'
        else:
            raise e
    cur.close()
    conn.close()
    return result_date

# Request pages, only return total page number
def get_num_pages(api_endpoint,auth_token,as_of_date):
    header = {'Authorization': auth_token}
    total_pages = requests.get(api_endpoint+'?page=1&q[updated_at_gt]='+str(as_of_date),headers=header).json()['total_pages']
    return total_pages

# Returns a specific page given a specific "as of" date and page number
def get_page(api_endpoint,auth_token,as_of_date,page_num):
    header = {'Authorization': auth_token}
    print(f"loading data from endpoint: {api_endpoint}" )
    page = requests.get(api_endpoint+'?page='+str(page_num)+'&q[updated_at_gt]='+str(as_of_date),headers=header).json()
    return page

# Loads data into snowflake
def load_to_snowflake(store_name, source_api, api_key, updated_date, total_pages, conn, stage_table, json_element):
    cur = conn.cursor()
    create_query = f"CREATE TABLE IF NOT EXISTS {stage_table} ( store_name VARCHAR , elt_date TIMESTAMPTZ, data VARIANT)"
    cur.execute(create_query)
    
    # loop through the pages
    for page_number in range(1,total_pages+1,1):
        response_json = get_page(source_api,api_key,updated_date,page_number)
        raw_json = response_json[json_element]
        raw_data = json.dumps(raw_json)
        # some fields need to be escaped for single quotes
        clean_data = raw_data.replace('\\', '\\\\').replace("'", "\\'")
        cur.execute(f"INSERT INTO {stage_table} (store_name, elt_date, data) SELECT '{store_name}', CURRENT_TIMESTAMP , PARSE_JSON('{clean_data}')")
        print(f"loaded {page_number} of {total_pages}")
    
    cur.close()
    conn.close()

# create a wrapper for previous functions so we can invoke a single statement for a given API
def job_wrapper(store_name, api_path, api_key, target_table, target_table_key):
    # get the updated date for a specific table
    updated_date = last_run_date(snowflake_login(), target_table, store_name)
    print(f"The maximum value in the 'updated_at' column of the {target_table} table is: {updated_date}")

    # get the number of pages based on the updated date
    pages = get_num_pages(api_path,api_key,updated_date)
    print(f"There are {pages} pages to load in the sales API")

    # load to snowflake
    load_to_snowflake(store_name, api_path, api_key,updated_date,pages,snowflake_login(),target_table, target_table_key)

Now that we have our python in place, we can invoke a specific API. It should be noted that Hex also has built-in environmental variable management, so we can keep our keys safe while still having a nice development & production flow.

Python

job_wrapper('store_name','api_url',AUBURN_API_KEY,'end_point_name','endpoint_unique_key')

To deploy this for more endpoints, simply update the api_url, end_point_name, and endpoint_unique_id. You can also hold it in a python dict and reference it as a variable, but I found that to be annoying when troubleshooting.

The last step in Hex is to publish the notebook so that you can set a cron job on it – I set mine to run at midnight PST.

Transforming in dbt

I am using on-run-start & on-run-end scripts in my dbt project to frame out the database, in my case, Snowflake.

SQL

on-run-start:
  - CREATE TABLE IF NOT EXISTS STAGING.sales_histories ( store_name VARCHAR , elt_date TIMESTAMPTZ, data VARIANT, id INT) ;

Now that data is in snowflake (in the RAW schema), we can use a macro in dbt to handle our transformation from pages coming from the API to rows in a database. But first we need to define our sources (the tables built in the on-run-start step) in YAML.

YAML

version: 2

sources:
  - name: SOURCE_NAME
    database: DWH_V2
    schema: STAGING
    tables:
      - name: sales_histories

Repeat for each API end point that you want to stage in your database.

Now consider the following model which transforms the JSON pages to rows:

SQL

{{ config(pre_hook="{{ merge_queues( 'sales_histories' , 'STAGING','ticketId' ) }}") }}

select 
    *,
    data:updated_at::datetime as updated_at
from {{ source( 'POSABIT', 'sales_histories' ) }}

Of course, the real magic here is in the “merge_queues” macro, which is below:

SQL

{% macro merge_queues( table_name, schema, unique_id )%}
    MERGE INTO {{schema}}.{{table_name}} t
        USING (
            with cte_top_level as (
            -- we can get some duplicate records when transaction happen as the API runs
            -- as a result, we want to take the latest date in the elt_date column
            -- this used to be a group by, and now is qualify
                select
                    store_name,
                    elt_date,
                    value as val,
                    val:{{unique_id}} as id
                from RAW.{{table_name}},
                lateral flatten( input => data )
                QUALIFY ROW_NUMBER() OVER (PARTITION BY store_name, id ORDER BY elt_date desc) = 1
            )
            select 
                *
            from cte_top_level
        ) s
        ON t.id = s.id AND t.store_name = s.store_name
        -- need to handle updates if they come in
        WHEN MATCHED THEN
            UPDATE SET t.store_name = s.store_name,
                t.elt_date = s.elt_date,
                t.data = s.val,
                t.id = s.id
        WHEN NOT MATCHED THEN
            INSERT ( store_name, elt_date, data, id)
            VALUES ( s.store_name, s.elt_date, s.val, s.id);

    -- truncate the queue
    TRUNCATE RAW.{{table_name}};
{% endmacro %}

A key note here is that snowflake does not handle MERGE like an OLTP database, so we need to de-duplicate it before we INSERT or UPDATE. I learned this the hard way by trying to de-dupe once the data was into my staging table, but annoyingly this is not easy in snowflake! So I had to truncate and try again a few times.

Now that the data is in a nice tabular format, we can run it like a typical dbt project.

Let me know if you have any questions or comments – you can find me on twitter @matsonj

Other notes

There are lots of neat features that I didn’t end up implementing. A noncomprehensive list is below:

Source control + CI/CD for the Hex notebooks – the Hex flow is so simple that I didn’t feel this was necessary.
Hex components to reduce repetition of code – today, every store gets its own notebook.
Using mdsinabox patterns with DuckDB instead of Snowflake – although part of the reason to do this was to defer infrastructure to bundled vendors.

Running MDS-in-a-box in Docker

I didn’t really set out to learn Docker when I started the MDS-in-a-box project, but as it turns out, Docker is quite a good fit. Part of this is because I desired to run the project in a Github Action, which is a very similar paradigm, and also because I have the notion (TBD) of running a bunch of simulations in AWS Batch. The goal of this post is to show a quick demo and then summarize what I learned – which frankly will also serve as a quick reference for me when I use Docker again.

Running the project in Docker

Once Docker Desktop is installed, building the project is trivial with two ‘make’ scripts.

make docker-build
make docker-run-superset

This takes a few minutes, but once its complete you have a full operational analytics stack running inside your machine.

The first rule of Docker

I learned this one the hard way, as I attempted to add evidence.dev to my existing container. The environment was only based on Python, and I needed to add Node support to it. I tried and tried to modify the dockerfile to get Node working – which leads to the first rule of Docker:

Thou Shalt Use An Existing Base Image

As it turns out, a quick googling revealed that there was already an awesome set of python+node base images. Shout out to this repo which is what I ended up using: Python with Node.js.

Now that I had the Docker container “working” – I needed to actually figure out which docker commands to use.

Docker Quick Reference

These are the commands that I learned and used over and over again as I triaged my way through adding another component to my environment. It is not exhaustive but designed to be a practical list of key commands to help you get started with Docker, too.

docker build – use this to build the image defined in your working directory. In my project, I’m also giving it a name (-t mdsbox) and defining where to save it, so the full command is ‘docker build -t mdsbox .‘
docker run – use this to run your image as a container once its built. You also pass in your environmental variables as part of docker run, so this command gets a bit long. Unfortunately, this is the first command that you see when learning Docker, which makes it look more imposing and scary than it actually is. The general syntax is ‘docker run <docker config> <CLI command>‘.
docker ps – use this command to see which containers are running. This is so you know which containers to stop or to access (via docker exec) within the CLI.
docker stop – this command stops a container. If you run a container from the terminal, you can’t stop it or exit like a process running in the terminal (i.e. with Ctrl+D), so you need to use ‘docker stop <container name>‘ instead!
docker exec – this command lets you run a command on a running container. I found this be absolutely huge for debugging as you can get right into the terminal on your container and futz around with it. The command I used to access it is ‘docker exec -it <container name> /bin/bash‘ which drops you into the terminal.
–publish – I’m including this Docker flag, since this is the flag you invoke to make your application visible on the network. Used in context, it looks something like this: ‘docker run –publish 3000:3000 <container name>‘. It is simply mapping port 3000 on the host to port 3000 on the container.

There are some notable exclusions, like ‘docker pull‘ but this reference is merely to help get started with MDS-in-a-box. By the way, you can check out the latest deployed version at www.mdsinabox.com!

As a note, I want to thank Pedram Navid & Greg Wilson for being my Docker shepherds – I definitely was stuck a few times and your guidance was incredibly helpful in getting things unstuck!

What Good Data Self-Serve Looks Like

I once was tasked with figuring out how to ‘democratize data’ for internal employees. No other instructions, solely a general pain point of ‘the data team is stuck doing ad-hoc tickets’ and ‘stakeholders want to get data on their own.’ After floundering for a while, I set out to figure out what data self-serve looked like at other companies. Seemed simple enough. But I quickly learned things aren’t that simple, and when are they for cross-functional data projects, anyway?

I want to share what I learned during an earnest effort to stand up data self-serve. I know others are struggling with this same project and its ambiguities and humongous scope. I spent time reading, thinking, attempting, failing, trying again, failing again, trying again and seeing success. Let’s dive in.

Executive Summary

Data rarely moves fast enough across companies to enable data-informed decisions. The data team is a bottleneck behind which many requests stack up in a queue. The data team drowns in questions and stakeholders become frustrated. Long wait times ensue, forcing the business one of three decisions:

(1) Wait to make a decision
(2) Make a decision without data
(3) Departments hire their own data workers.

The data velocity problem is not reasonably solved through sheer volume of hiring. Data workers are expensive and hard to find. Instead, data teams typically pivot to enabling the business via self-serve and data democratization. Ideally, this unlocks the data team to focus on strategic analyses and initiatives and the business is freed to find the data they need without submitting a ticket.

Effective data teams must pivot away from (or avoid entirely!) taking tickets and into partnership with the organization, focusing on building scalable data solutions from which others can self-serve.

Data Self-Serve Definition

Data self-serve is notoriously hard to define. Any definition is inevitably specific to a company and point in time. The definition below is therefore intentionally generic:

Ability for any employee to quickly find and leverage the data and insights they need for their role without funneling through the Data Team

Good Self-Service Always Looks Different

(and bad self-service always looks the same)

Data self-serve done well, by its very nature, looks very different from company to company. The tools, processes, and organization structure supporting self-serve requires tailoring to the organization, rather than following a blueprint.

In contrast, data self-serve nearly always looks the same when done poorly. The data team is overwhelmed with tickets, business users can’t find data and strategic analyses do not happen.

Put another way, bad data organizations tend to look the same, but working data organizations look very different from each other.

This puts data teams in a liberating but uncomfortable position. They’re able to freely define and create the self-service experience that their organization needs at this very moment, but at the same time there’s no blueprint for success. This requires an interactive approach to find the best solution for their company.

The focus must be on the stakeholders’ needs, not the data team’s needs. Data teams have a habit of making data-self serve in their image instead of thinking about who they’re serving. The focus must be and will be squarely on the data self-serve experience of coworkers.

Typical Hurdles to Self-Serve

It’s easy to conjure up a world where someone types or speaks a question into a machine and gets the data they need. This is the ultimate data self-serve utopia and one on full display in the excellent sci-fi show The Expanse. The main characters routinely verbally ask computers questions like “Pull up every ship within X distance which was made by Y company and left Z destination in the last week.”

There are many hurdles before that sort of world is possible. Let’s cover a few of them:

Data operated as a service

Many data teams operate with a “receive a ticket, answer a ticket” mindset. This limits the team’s output into reactive short-term and smaller-scale asks and puts them as a bottleneck between the business and data. This is not an uncommon problem within the data industry. Data service teams are typically overstretched and struggle to answer all the questions coming their way. The business inevitably assumes the data team isn’t a strategic partner as they don’t seem to operate like one. If you operate like a service desk, you’ll be treated like one.

This team structure rarely scales. I recall a discussion with a C-Suite member who criticized the data team as “Getting me answers so late that I’ve forgotten my question by the time they reply.” Ouch.

In contrast, effective BI teams operate with a product-like mindset that focuses on scale and solutions. They partner closely with stakeholders to solve problems and prioritize ruthlessly based on business impact.

Data foundations not yet ready for self-serve

There is an immense amount of work required to get data ready for self-serve. Just cleaning up a few raw data tables isn’t enough. Each part of the business (Sales, Marketing, Product, etc.) need different sets of data to answer their unique use cases.

Prepping data into the right shape requires close partnership and collaboration between the data team and its internal stakeholders. This requires steps like data ingestion and transformation, implementing tooling like Git and dbt and having a team that can support the data lifecycle of a company.

Lack of data literacy

Data literacy, much like self-serve, is a tough term to nail down. This deserves its own discussion entirely, but for now let’s go with a typically squishy definition along the lines of “How well stakeholders can interact with and understand data.”

Training for data literacy is immensely difficult. Even if you have the world’s best data models and data marts and the Modern Data Stack™️ stakeholders will struggle to find value if they can’t grok the internal business data model or fall into common data pitfalls. This hurdle must be overcome no matter how well you do everything else.

Lack of data tools that enable self-serve

Typically there are two primary ways that an individual can self-serve data: SQL on a database or look at a data-team-created dashboard. SQL is great for technical individuals but is not an option for the majority of employees. Dashboards are usually widely available but lack customization. Generally dashboards are a “you get what you get” type of experience, with little to no drill down capability and slow turnaround times from data teams for enhancement requests.

Data teams must provide other options for non-SQL savvy users to explore data in a more ad-hoc sense, leveraging curated/enriched tables created for their department. This can look as simple as providing access to enriched data in Excel or “Reverse ETL” where you send data back to source systems like Salesforce for direct consumption in those contexts.

Data privacy

This varies company to company, but data privacy comes into play depending on industry and company size. And whenever privacy is a factor, data access becomes more difficult. Typical lines in the sand are material non-public information (MNPI) when a company is publicly traded or personally identifiable information (PII) that only specific people should have access to.

Data self-serve almost always runs into data privacy concerns and the height of this hurdle (or wall…) will depend on the company.

But…What Does Good Self-Serve Look Like?!

Even though I described at length that good self-serve usually looks different, there are still some guiding principles to shoot for. These may not be universal and may change depending on your company’s data maturity, but they should be helpful.

(1) Focus initial efforts on specific departments/teams

One common bugaboo is an attempt to boil the ocean. The data team is already spread thin and pivoting everyone to self-serve for all departments at the same time will be too much. Instead, focus on a couple teams or departments with clear self-serve needs. Assign a specific analyst or two for the project who already understands a particular business domain and want to take on the challenge.

This both narrows the scope and increases the likelihood of close partnership with those teams. With partnership will come alignment on business value and understanding of pain points. Everyone wins.

(2) Create roadmap in partnership with stakeholders

Self-serve must necessarily look different from department to department. The needs for Finance are wholly different from Product, Engineering, Field, Legal or Marketing. This is why a focus on self-serve and a dedicated BI partner is so crucial. Requirement gathering and roadmap creation must be done in close collaboration between BI and each department. Examples of requirements to gather include:

Examples of requirements to gather:

* Use cases
* Defining personas (technical/non-technical/etc)
* Tools needed
* Datasets
* Training/Enablement

(3) Build source of truth data marts

A data mart is a set of tables designed for ease of use by a department for their self-serve needs. These tables are specifically curated by the data team to make data easy to consume and understand for a particular department.

Just providing individuals with access to the entire database is inevitably overwhelming. There could be hundreds of billions of data points across thousands of columns and hundreds to thousands of tables. Many analysts need a year to become comfortable with data at its most granular state. Expecting non-analysts to just hop in and find value isn’t reasonable.

To avoid this steep learning curve, a curated data mart enables self-service without overwhelming stakeholders. This curated data mart must be built in close collaboration between the data partner and their stakeholders to find the sweet spot of “plenty of data” and “not confusing”.

Example: The sales team needs a few good tables such as Account, Opportunity and Task from which they can build most any report they need.

(4) Create an adoption and discoverability program

Data discoverability is an enormous challenge that must be tackled on several angles. The existence of data marts alone is not enough to drive adoption if individuals do not know how they exist or do not know how to leverage them.

To drive adoption, efforts must include:

* Training / Onboarding sessions for all stakeholders
* Clear documentation for all data marts, tools available, key reports
* Weekly office hours
* Monthly & quarterly prioritization meetings
* Deprecation process to clean out old/unused data products

Parting Notes

There’s much more to write and I’ll follow up around defining internal stakeholder personas and choosing technologies that solve different aspects of data self-serve. For now, I hope the key message you took away from this is:

“I’m empowered to figure out how to best do self-serve at my organization.”

There’s an art to this task, and that’s why it’s so difficult to find anyone giving a blueprint. There really isn’t one. And you’ll never “arrive” at the conclusion of this project. You’ll just continually improve it, much like you do all your other data efforts. The fun is in the journey.

Passing dynamic parameters to SQL Server stored procedures with dbt

If you are using SQL Server with dbt, odds are that you probably have some stored procedures lurking in your database. And of course, the sql job agent is probably running some of those on a cron. I want to show another way to approach these, using dbt run-operations and GitHub actions. This will allow you to have a path towards moving your codebase into a VCS like git.

Unwrapping your wrapper with jinja

The pattern I am most familiar with is using the sql agent to run a “wrapper”, which servers to initialize the set of variables to pass into your stored procedure. The way I have done this with dbt is a bit different, and split into two steps: 1) writing the variables into a dbt model and 2) passing that query into a table that dbt can iterate on.

Since your model to stuff the variables into a table (step 1) is highly contextual, I’m not going to provide an example, but I will show how to pass an arbitrary sql query into a table. Example below:

{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

For those of you from the SQL Server world – the metaphor here is a temporary table. You can find more about run_query here.

Agate & for loops

What we have created with the run_query macro is an Agate table. This means we can perform any of the Agate operations on this data set, which is pretty neat! In our case, we are going to use a python for loop and pass in the rows of our table.

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% do run_query(stored_procs) %}
    {% set stored_procs = true %}
{% endfor %}

The clever thing to do here with python is that we can pass multiple columns into our stored procedure, which differs from something like dbt_utils.get_column_values that can also be used as part of a for loop, but only for a single column. In this case we can reference which column to return from our table with variable[n], so i[0] returns the value in the first column in the current row, i[1] returns the second column and so on.

Building the entire macro

Now that we have the guts of this worked out, we can pull it together in an entire macro. I’m adding ‘dry_run’ flag so we can see what the generate SQL is for debugging purposes, without having to execute our procedure. As a side note, you could also build this as a macro that you run as pre or post hook, but in that case you would need to include an ‘if execute‘ block to make sure you don’t run the proc when project is compiled and so on.

-- Execute with: dbt run-operation my_macro --args '{"dry_run": True}'
-- to run the job, run w/o the args

{% macro my_macro(dry_run='false') %}
{% set sql_statement %}
    SELECT * FROM {{ ref( 'my_model' ) }}
{% endset %}

{% do log(sql_statement, info=True) %}

{%- set table = run_query(sql_statement) -%}

{% for i in table.rows  -%}
    {% set stored_procs %}
        EXECUTE dbo.your_procedure
            @parameter_1 = {{ i[0] }}
            , @parameter_2 = {{ i[1] }}
    {% endset %}
    {%- do log("running query below...", info=True)  -%}
    {% do log(stored_procs, info=True) %}
    {% if dry_run == 'false' %}
        {% do run_query(stored_procs) %}
    {% endif %}
    {% set stored_procs = true %}
{% endfor %}  
{% do log("my_macro completed.", info=True) %}
{% endmacro %}

Running in a Github action

Now that we have the macro, we can execute in dbt with ‘dbt run-operation my_macro’. Of course, this is great when testing but so no great if you want this in production. There are lots of ways you run this: on-run-start, on-run-end, as a pre or post-hook. I am not going to do that in this example, but instead share how you can run this a stand alone operation in github actions. I’ll start with the sample code.

name: run_my_proc

on:
  workflow_dispatch:
    # Inputs the workflow accepts.
    inputs:
      name:
        # Friendly description to be shown in the UI instead of 'name'
        description: 'What is the reason to trigger this manually?'
        # Default value if no value is explicitly provided
        default: 'manual run for my stored procedure'
        # Input has to be provided for the workflow to run
        required: true

env:
  DBT_PROFILES_DIR: ./
  MSSQL_USER: ${{ secrets.MSSQL_USER }}
  MSSQL_PROD: ${{ secrets.MSSQL_PROD }}
  MSSQL_LOGIN: ${{ secrets.MSSQL_LOGIN }}
   
jobs:
  run_my_proc:
    name: run_my_proc
    runs-on: self-hosted

    steps:
      - name: Check out
        uses: actions/checkout@master
      
      - name: Get dependencies # ok guess I need this anyway
        run: dbt deps --target prod

      - name: Run dbt run-operation
        run: dbt run-operation my_macro

As you can see – we are using ‘workflow_dispatch’ as our hook for the job. You can find out more about this in the github actions documentation. So now what we have in github is the ability to run this macro on demand with a button press. Neat!

Closing thoughts

One of the challenges I have experienced with existing analytics projects on SQL Server and dbt is “what do I do about my stored procedures”. They can be very hard to fit into the dbt model in my experience. So this is my attempt at a happy medium where you can continue to use those battle tested stored procedures while continuing build out and migrate towards dbt. Github actions is a simple, nicely documented way to start moving logic away from the sql job agent, and you can run it “on-prem” if you have that requirement. Of course, you can always find me on twitter @matsonj if you have questions or comments!

Three steps to handling sharded databases with dbt

A common pattern in scaling production app databases is to keep them as small as possible. Since building production apps is not my forte, I’ll lean on the commentary of experts. I like how Silvia Botros, author of High Performance MySQL, frames it below:

https://twitter.com/dbsmasher/status/1520124723870375937

just keep sharding, just keep sharding…

This architecture presents a unique challenge for analytics engineering because you now have many databases with identical schemas, and dbt sources must be enumerated in your YAML files.

I am going to share the three steps that I use to solve this problem. It should be noted that if you are comfortable with jinja, I am sure there are better, more pythonic ways to solves this problem. I have landed on this solution as something that is easy to understand, fast to develop, and fast to run (i.e. performant).

Step 1: leverage YAML anchors and aliases

Anchors and Aliases are YAML constructions that allow you to reduce repeat syntax and extend existing data nodes. You can place Anchors (&) on an entity to mark a multi-line section. You can then use an Alias (*) call that anchor later in the document to reference that section.
https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet

By using anchors and aliases, we can drastically cut down on the amount of duplicate code that we need to write in our YAML file. A simplified version of what I have is below.

  - name: BASE_DATABASE
    database: CUSTOMER_N
    schema: DATA
    tables: &SHARD_DATA
      - name: table_one
        identifier: name_that_makes_sense_to_eng_but_not_data
        description: a concise description
      - name: table_two

  - name: CUSTOMER_DATABASE
    database: CUSTOMER_N+1
    schema: DATA
    tables: *SHARD_DATA

Unfortunately with this solution, every time a new shard is added, we have to add a new line to our YAML file. While I don’t have a solution off hand, I am certain that you could generate this file with Python.

Step 2: Persist a list of your sharded databases

This next steps seems pretty obvious, but you need a list of your shards. There are multiple ways to get this data, but I will share two of them. The first is getting the list directly from your information schema.

(SQL SERVER)
SELECT * FROM sys.databases;

(SNOWFLAKE)
SELECT * FROM information_schema.databases

You can then persist that information in a dbt model that you can query later.

The second way is to create a dbt seed. Since I already have a manual intervention in step 1, I am ok with a little bit of extra work in managing a seed as well. This also gives me the benefit of source control so I can tell when additional shards came online. And of course, this gives a little finer control over what goes into your analytics area since you may have databases that you don’t want to include in the next step. An example seed is below.

Id,SourceName
1,BASE_DATABASE
2,CUSTOMER_DATABASE

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

The of magic enabled by dbt here is that you can put a for loop inside your SQL query. This means that instead of writing out hundreds or thousands of lines of code to load your data into one place, dbt will instead generate it. Make sure that you have dbt_utils in your packages.yml file and that you have run ‘dbt deps’ to install it first.

{% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
{% for sn in source_names %}
  SELECT field_list,
    '{{ sn }}' AS source_name
  FROM {{ source( sn , 'table_one' ) }} one
    INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
  {% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

In the case of our example, since we have two records in our ‘seed’ table, this will create two SQL queries with a UNION between them. Perfect!

Now I have scaled this to 25 databases or so, so managing it by hand works fine for me. Obviously if you have thousands of databases in production in this paradigm, running a giant UNION ALL may not be feasible (also I doubt you are reading this article if you have that many databases in prod). In fact, I ran into some internal constraints with parallelization with UNION with some models, so I use pre and post-hooks to handle it in a more scalable manner for those. Again, context matters here, so depending on the shape of your data, this may not work for you. Annoyingly, this doesn’t populate the dbt docs with anything particularly meaningful so you will need to keep that in mind.

(SQL SERVER)

{{ config(
    materialized = "table",
    pre_hook="
      DROP TABLE IF EXISTS #source;
      CREATE TABLE #source
      (
        some_field INT
      );

      {% set source_names = dbt_utils.get_column_values(table=ref('seed'), column='SourceName') %}
      {% for sn in source_names %}
        SELECT field_list,
          '{{ sn }}' AS source_name
        FROM {{ source( sn , 'table_one' ) }} one
          INNER JOIN {{ ref( 'table_two' ) }} two ON one.id = two.id
       {% endfor %} 
       DROP TABLE IF EXISTS target;
       SELECT * INTO target FROM #source",
    post_hook="
      DROP TABLE #source;
      DROP TABLE target;"
  )
}}    

SELECT * FROM target

So there you have it, a few ways to pull multiple tables into one with dbt. Hope you found this helpful!

Alternative methods: using dbt_utils.union_relations

In theory, using dbt_utils.union_relations can also accomplish the same as step 3, but I have not tested it that way.

Executive Summary

Problem 1: Maintaining Firmographic Data is a Challenge

Problem 2: Overwhelming Industry Categories

Solving Problem #2, First

Solving Problem #1 Using an LLM in SQL

What next?

Definition and why the role matters

How Analytics Engineering fits within the data ecosystem

Reality: Things aren’t that clear cut

What’s next?

How to run uv in a dev container

How to run uv in Github Actions

Executive Summary

Our Methodology

Findings & Results

Raw Frequencies

Weighted Frequencies

Expected Value by Square5

Raw Frequencies for Total Points u47.5

Raw Frequencies for Total Points o47.5

Selected Conclusions

Areas for Research Expansion

Appendix A: Winning Criteria

Appendix B: Weighted Square Value

Appendix C: DraftKings Sportsbook Available Odds

General Approach

Loading data with Hex

Transforming in dbt

Other notes

Running the project in Docker

The first rule of Docker

Docker Quick Reference

Executive Summary

Data Self-Serve Definition

Good Self-Service Always Looks Different

Typical Hurdles to Self-Serve

Data operated as a service

Data foundations not yet ready for self-serve

Lack of data literacy

Lack of data tools that enable self-serve

Data privacy

But…What Does Good Self-Serve Look Like?!

(1) Focus initial efforts on specific departments/teams

(2) Create roadmap in partnership with stakeholders

(3) Build source of truth data marts

(4) Create an adoption and discoverability program

Parting Notes

Unwrapping your wrapper with jinja

Agate & for loops

Building the entire macro

Running in a Github action

Closing thoughts

Step 1: leverage YAML anchors and aliases

Step 2: Persist a list of your sharded databases

Step 3: Use jinja + dbt_utils.get_column_values to procedurally generate your SQL

Alternative methods: using dbt_utils.union_relations

Expected Value by Square⁵