Nate Sooter, Author at DataDuel.co

LLMs in SQL? A real-world application to clean up your CRM data

Executive Summary

Firmographic data, like industry classification, employee count, and location, is notoriously messy and challenging to analyze. Modern databases offer a unique opportunity to clean and maintain this data: by leveraging an LLM, you can synthesize industry information for your accounts directly within a SQL query.

I recently took on exactly this challenge, so this isn’t an abstract solution. I’m using generic data below for data privacy reasons, but this is a real-world solution for a real-world problem I solved for a RevOps team.

h/t to Jacob Matson who demonstrated the underlying concept in MotherDuck:

Are you kidding me? Look at this: https://t.co/V8RIK3fpYe pic.twitter.com/OP13wcEiIu
— Jacob Matson (@matsonj) October 17, 2024

Problem 1: Maintaining Firmographic Data is a Challenge

One dreaded question for analysts on the sales or revenue team: “quick qq: Can I get our sales broken out by industry?” This question often signals a few late nights spent cleaning up industry data to provide a rough answer to stakeholders who, meanwhile, are wondering, “Why is this taking so long? Don’t we already pay for firmographic data?”

How often does industry data look like this? Way too often.

The never-ending hamster wheel of maintaining firmographic data is a universal challenge. There’s a reason why Dun & Bradstreet took in $2.3B in revenue in 2023 providing this data: it’s not easy to aggregate or keep up to date.

Problem 2: Overwhelming Industry Categories

There is another visible failure point in the chart above: an excessive number of industry categories.

A hidden aspect of any industry cleanup project is narrowing down to a manageable list of 10–15 industry categories. Leave the intricate sub-industries for later—they’re harder to get right, demand 10 times the maintenance effort, and are rarely used. Only address them when there’s a strong business case to do so.

Solving Problem #2, First

Before we can get to the neat LLM-in-SQL work, we need a clear list of industries to ask the LLM to leverage. Usually you want to make this list in collaboration with business, but GPT-4o can generate a solid starting point as well. Here’s the final list I landed on after refining GPT-4o’s starting point to match my preferences::

Technology
Healthcare & Life Sciences
Financial Services
Retail & Consumer Goods
Manufacturing & Industrial
Education
Public Sector & Government
Energy & Utilities
Real Estate & Construction
Transportation & Logistics
Hospitality & Recreation
Professional Services
Nonprofit & Social Impact
Media & Entertainment

Keep in mind, this isn’t a trivial task if your company hasn’t tackled it seriously before. There are a near-infinite amount of industry breakdowns out there, so you need to give purposeful thought to how to group industries based on your own customer base. Your list will likely differ from` others.

Solving Problem #1 Using an LLM in SQL

Now that we have an industry list, it’s time to ask an LLM to help. You could always upload that CSV to ChatGPT and ask it there, but that requires manual steps (download csv / upload to GPT / download csv / upload to DWH) which could get in the way of automating this in the future.

But Snowflake gives us the option to ask an LLM these questions directly in SQL, directly using context from our CRM and other data. Let’s take advantage!

For the test, I found 25 companies in my local area and wrote down only their names. Some are national brands or chains, while others are small mom-and-pop shops and everything in between. This gives me a list like the one below:

OK great! Now we need to get the syntax for querying an LLM in Snowflake. There’s plenty of documentation covering the available LLMs and functions, and the specific function we want to use is Complete. I tested all of the LLMs in Snowflake for this task, and the most consistently accurate one with reasonable processing times was llama3.1-70b, so we’ll use that LLM for now.

And here were the results. They are … uh … about what one expects on a first try. Overly verbose.

OK, time to do some “Prompt Engineering” to get this correct, because getting a paragraph for a response is incredibly not helpful.

Specifically, I added a line saying: Only return the industry in each response, I want no other text or explanations.

This works well! Very limited extra text, but it isn’t really categorizing like I want it to. For instance, I consider the After Image Styling Salon to be a Professional Service, not a Retail/Consumer Good.

The LLM would benefit from additional context for each industry for decision making. This is exactly what I would do if giving this task to someone at work. Given ambiguity, both a person and an LLM make arbitrary decisions.

Here’s the adjusted query:

Adjusted query with details for each industry included

The context helped! Notably, I found additional improvement when I described specific decision points in the text. Once I told the prompt that places like Auto Repair and Beauty were Professional Services, it correctly categorized them. Lesson being, if you are sure a specific company type should be in a specific industry, be explicit.

Here’s the new output:

Is this perfect? No. but it’s a great start given that it’s working solely off company name and several of these are small or family-owned.

Green = Match, Red = Not a Match, Yellow = Close

Overall:

16 of the 25 I consider correct or nearly correct (72%)
7 of the 25 I consider incorrect (28%)

Not a bad hit rate for a few minutes of an LLM prompt, with room to improve!

I get why names like “Taco Book” threw the LLM for a loop. That’s a local taco shop, but with the name “Book” in the title and very little online about the company, I get why the coin flip ended up on the side of Retail & Consumer Goods.

What next?

I foresee potential to use additional CRM data to make this approach even more accurate. Company descriptions, headquarters data, sales emails, etc. If it can get to ~7/10 off of name alone imagine the possible improvement with more refined prompt engineering combined with extra data context would .

For example, if you had information in your CRM about Taco Book as a restaurant, the LLM would pick up the industry rather quickly.

This feels like an application of LLMs in SQL that’s only scratching the surface of what is possible. RevOps teams everywhere may soon have a tool to update that pesky CRM data, and surely there are other use cases for this across other domains.

I’d love to hear from you if you are working on something similar!

Defining Analytics Engineering

No one knows what it means, but it’s provocative. It gets the people going!
Chazz Michael Michaels

Context: This is the first of a series of posts looking at Analytics Engineering as I’ve lived and breathed it since 2014, largely in the SaaS industry. This is not meant to provide universal truths, but rather give insight into one way to make sense of the data roles around us and how analytics engineering fits into them.

Definition and why the role matters

Analytics Engineering is the process of transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

This is a fairly recent title, popularized by the data transformation tool dbt in this writeup by Claire Carroll. It didn’t really take off until the advent of cloud data warehouses

At first, this role doesn’t seem necessary. Can’t people just … export data into Excel and do analysis? Haven’t data analysts been working for decades without “Analytics Engineers” existing?

Two replies:

Yeah, you can. Which works … until it doesn’t. But eventually scale becomes a problem as people waste time as Excel jockeying instead of whatever else they should be doing
Data analysts and others were already doing this work, it just didn’t have a name yet or was tool-specific (e.g., data munging in SQL Server Integration Studio)

I found myself doing analytics engineering in my very first data job 10 years ago. My title was “data analyst” and I described myself that way, but my work was curating datasets for executives and other business teams to leverage for decision making. The prior Excel-export model was running into scaling issues and I was able to save everyone time by automating data availability.

Analytics Engineering is both old and new. People have been doing the work for a long time, even if the title is relatively new.

How Analytics Engineering fits within the data ecosystem

A simplified view of the three main pillars of the data world

Analytics Engineering fits roughly between two types of data roles many are familiar with: Data Engineering and Data Analytics/Science. The far-too-simple dividing line between the roles is:

Data Engineering: Brings raw data from disparate systems into a single data warehouse

Analytics Engineering: Transforms the raw data within the warehouse into something useful by cleaning, adding business logic, etc,

Data Analytics/Science: Leverages datasets created by analytics engineers to produce analyses, dashboards, models and more for stakeholders across the organization

Reality: Things aren’t that clear cut

Life is never as nice as a graphic of an over-simplified view of the world. Analytics Engineering is no different, it’s a messy role in the middle of messy data. Here’s a closer representation to how these three categories fit together:

I could make this a lot messier. The lines between data teams are incredibly blurry.

Data engineers typically do some analytics engineering. Data analysts typically do analytics engineering. Analytics engineers do some of both.

Which makes sense! With the roles ill defined at most companies (including Analytics Engineering missing entirely!), you’ll get a mishmash of people doing a variety of these tasks. Often times, people are indirectly incentivized to blur the lines in their role to get a task or two done.

What’s next?

We’ll dive deeper into the core value prop of an analytics engineer: transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

What Good Data Self-Serve Looks Like

I once was tasked with figuring out how to ‘democratize data’ for internal employees. No other instructions, solely a general pain point of ‘the data team is stuck doing ad-hoc tickets’ and ‘stakeholders want to get data on their own.’ After floundering for a while, I set out to figure out what data self-serve looked like at other companies. Seemed simple enough. But I quickly learned things aren’t that simple, and when are they for cross-functional data projects, anyway?

I want to share what I learned during an earnest effort to stand up data self-serve. I know others are struggling with this same project and its ambiguities and humongous scope. I spent time reading, thinking, attempting, failing, trying again, failing again, trying again and seeing success. Let’s dive in.

Executive Summary

Data rarely moves fast enough across companies to enable data-informed decisions. The data team is a bottleneck behind which many requests stack up in a queue. The data team drowns in questions and stakeholders become frustrated. Long wait times ensue, forcing the business one of three decisions:

(1) Wait to make a decision
(2) Make a decision without data
(3) Departments hire their own data workers.

The data velocity problem is not reasonably solved through sheer volume of hiring. Data workers are expensive and hard to find. Instead, data teams typically pivot to enabling the business via self-serve and data democratization. Ideally, this unlocks the data team to focus on strategic analyses and initiatives and the business is freed to find the data they need without submitting a ticket.

Effective data teams must pivot away from (or avoid entirely!) taking tickets and into partnership with the organization, focusing on building scalable data solutions from which others can self-serve.

Data Self-Serve Definition

Data self-serve is notoriously hard to define. Any definition is inevitably specific to a company and point in time. The definition below is therefore intentionally generic:

Ability for any employee to quickly find and leverage the data and insights they need for their role without funneling through the Data Team

Good Self-Service Always Looks Different

(and bad self-service always looks the same)

Data self-serve done well, by its very nature, looks very different from company to company. The tools, processes, and organization structure supporting self-serve requires tailoring to the organization, rather than following a blueprint.

In contrast, data self-serve nearly always looks the same when done poorly. The data team is overwhelmed with tickets, business users can’t find data and strategic analyses do not happen.

Put another way, bad data organizations tend to look the same, but working data organizations look very different from each other.

This puts data teams in a liberating but uncomfortable position. They’re able to freely define and create the self-service experience that their organization needs at this very moment, but at the same time there’s no blueprint for success. This requires an interactive approach to find the best solution for their company.

The focus must be on the stakeholders’ needs, not the data team’s needs. Data teams have a habit of making data-self serve in their image instead of thinking about who they’re serving. The focus must be and will be squarely on the data self-serve experience of coworkers.

Typical Hurdles to Self-Serve

It’s easy to conjure up a world where someone types or speaks a question into a machine and gets the data they need. This is the ultimate data self-serve utopia and one on full display in the excellent sci-fi show The Expanse. The main characters routinely verbally ask computers questions like “Pull up every ship within X distance which was made by Y company and left Z destination in the last week.”

There are many hurdles before that sort of world is possible. Let’s cover a few of them:

Data operated as a service

Many data teams operate with a “receive a ticket, answer a ticket” mindset. This limits the team’s output into reactive short-term and smaller-scale asks and puts them as a bottleneck between the business and data. This is not an uncommon problem within the data industry. Data service teams are typically overstretched and struggle to answer all the questions coming their way. The business inevitably assumes the data team isn’t a strategic partner as they don’t seem to operate like one. If you operate like a service desk, you’ll be treated like one.

This team structure rarely scales. I recall a discussion with a C-Suite member who criticized the data team as “Getting me answers so late that I’ve forgotten my question by the time they reply.” Ouch.

In contrast, effective BI teams operate with a product-like mindset that focuses on scale and solutions. They partner closely with stakeholders to solve problems and prioritize ruthlessly based on business impact.

Data foundations not yet ready for self-serve

There is an immense amount of work required to get data ready for self-serve. Just cleaning up a few raw data tables isn’t enough. Each part of the business (Sales, Marketing, Product, etc.) need different sets of data to answer their unique use cases.

Prepping data into the right shape requires close partnership and collaboration between the data team and its internal stakeholders. This requires steps like data ingestion and transformation, implementing tooling like Git and dbt and having a team that can support the data lifecycle of a company.

Lack of data literacy

Data literacy, much like self-serve, is a tough term to nail down. This deserves its own discussion entirely, but for now let’s go with a typically squishy definition along the lines of “How well stakeholders can interact with and understand data.”

Training for data literacy is immensely difficult. Even if you have the world’s best data models and data marts and the Modern Data Stack™️ stakeholders will struggle to find value if they can’t grok the internal business data model or fall into common data pitfalls. This hurdle must be overcome no matter how well you do everything else.

Lack of data tools that enable self-serve

Typically there are two primary ways that an individual can self-serve data: SQL on a database or look at a data-team-created dashboard. SQL is great for technical individuals but is not an option for the majority of employees. Dashboards are usually widely available but lack customization. Generally dashboards are a “you get what you get” type of experience, with little to no drill down capability and slow turnaround times from data teams for enhancement requests.

Data teams must provide other options for non-SQL savvy users to explore data in a more ad-hoc sense, leveraging curated/enriched tables created for their department. This can look as simple as providing access to enriched data in Excel or “Reverse ETL” where you send data back to source systems like Salesforce for direct consumption in those contexts.

Data privacy

This varies company to company, but data privacy comes into play depending on industry and company size. And whenever privacy is a factor, data access becomes more difficult. Typical lines in the sand are material non-public information (MNPI) when a company is publicly traded or personally identifiable information (PII) that only specific people should have access to.

Data self-serve almost always runs into data privacy concerns and the height of this hurdle (or wall…) will depend on the company.

But…What Does Good Self-Serve Look Like?!

Even though I described at length that good self-serve usually looks different, there are still some guiding principles to shoot for. These may not be universal and may change depending on your company’s data maturity, but they should be helpful.

(1) Focus initial efforts on specific departments/teams

One common bugaboo is an attempt to boil the ocean. The data team is already spread thin and pivoting everyone to self-serve for all departments at the same time will be too much. Instead, focus on a couple teams or departments with clear self-serve needs. Assign a specific analyst or two for the project who already understands a particular business domain and want to take on the challenge.

This both narrows the scope and increases the likelihood of close partnership with those teams. With partnership will come alignment on business value and understanding of pain points. Everyone wins.

(2) Create roadmap in partnership with stakeholders

Self-serve must necessarily look different from department to department. The needs for Finance are wholly different from Product, Engineering, Field, Legal or Marketing. This is why a focus on self-serve and a dedicated BI partner is so crucial. Requirement gathering and roadmap creation must be done in close collaboration between BI and each department. Examples of requirements to gather include:

Examples of requirements to gather:

* Use cases
* Defining personas (technical/non-technical/etc)
* Tools needed
* Datasets
* Training/Enablement

(3) Build source of truth data marts

A data mart is a set of tables designed for ease of use by a department for their self-serve needs. These tables are specifically curated by the data team to make data easy to consume and understand for a particular department.

Just providing individuals with access to the entire database is inevitably overwhelming. There could be hundreds of billions of data points across thousands of columns and hundreds to thousands of tables. Many analysts need a year to become comfortable with data at its most granular state. Expecting non-analysts to just hop in and find value isn’t reasonable.

To avoid this steep learning curve, a curated data mart enables self-service without overwhelming stakeholders. This curated data mart must be built in close collaboration between the data partner and their stakeholders to find the sweet spot of “plenty of data” and “not confusing”.

Example: The sales team needs a few good tables such as Account, Opportunity and Task from which they can build most any report they need.

(4) Create an adoption and discoverability program

Data discoverability is an enormous challenge that must be tackled on several angles. The existence of data marts alone is not enough to drive adoption if individuals do not know how they exist or do not know how to leverage them.

To drive adoption, efforts must include:

* Training / Onboarding sessions for all stakeholders
* Clear documentation for all data marts, tools available, key reports
* Weekly office hours
* Monthly & quarterly prioritization meetings
* Deprecation process to clean out old/unused data products

Parting Notes

There’s much more to write and I’ll follow up around defining internal stakeholder personas and choosing technologies that solve different aspects of data self-serve. For now, I hope the key message you took away from this is:

“I’m empowered to figure out how to best do self-serve at my organization.”

There’s an art to this task, and that’s why it’s so difficult to find anyone giving a blueprint. There really isn’t one. And you’ll never “arrive” at the conclusion of this project. You’ll just continually improve it, much like you do all your other data efforts. The fun is in the journey.

Connect Snowflake to Excel in Minutes

Data “Self-Serve” is a buzzword that’s managed to stick around for a long time without a solution. However, I’m convinced that we can get partway there with simple data products rooted in familiar tools. One ubiquitious tool? Excel. Nearly everyone uses spreadsheets or similar productivity tools at work.

That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.

Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.

Step by Step Instructions

(1) Install the ODBC Driver

Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.

(2) Configure ODBC Driver

Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)

Under User DSN, select Add…

Select SnowflakeDSIIDriver from the menu

Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.

Note: I found that lots of databases & schemas are available even after choosing some here. Not sure the full limitations, so you can play with options. I put all options in for the primary database I cared about and it worked fine.

Click on Test… to confirm it worked. Here’s the dialog if it did:

(3) Connect to database in Excel

Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC

From the window that pops up, pick the Snowflake connection and select OK

If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.

IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.

Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”

The next menu that pops up will give you various options – pick the second one down saying PivotTable Report

DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.

Parting notes

There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.

(1) Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.

(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).

That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.

Start Simple With Your Analytics Project

Start Simple & Iterate

Up to this point, I’ve largely written for those looking to break into an analytics career. Today I’ll go beyond that and discuss the most powerful lesson I and many others learned — something I wish I fully understood starting out:

Start your analytics project as simple as possible and iterate from there.

This strategy borrows a lot from Agile software development not because I’m a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.

The Common Mistake

I’m going to assume you’ve already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.

Many analysts (myself included!) are then tempted to:

Retreat to your office
Gather & clean all the data you think everyone needs
Build the World’s Best V1 Dashboard
Schedule a meeting to present the dashboard
Receive unanimous praise for how amazing it is
Watch as everyone uses your dashboard daily

What really happens:

Retreat to your office
Gather & clean only some of the data people need
Spend way too long building the Dashboard No One Really Wanted
Stakeholders email you intermittently asking if you’re making progress
Schedule a meeting to present the dashboard
Entire meeting spent fielding questions like “Why don’t I see X or Y?”
Get the cold sweats realizing you don’t have what they need
Stakeholders frustrated that so much dev time was wasted
You’re frustrated that they are “changing what they need”
Retreat to your office

Why Does This Happen?

Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you’ll never make that mistake again. No one is immune.

There is one core reason why this happens: You assume you understand what the stakeholder wants.

Except you likely don’t. Especially when you’re early in your career. You’ll think you’re on the same page with your stakeholder, but you aren’t. You think you know what data points the stakeholder needs, but you don’t (hint: the stakeholder likely doesn’t know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don’t.

In fact, it’s so difficult to get everything right the first time, you should assume you don’t fully understand the request. That one time you actually do build “The World’s Best V1 Dashboard”, celebrate the unexpected success – it won’t happen often.

Strategy: Start Simple

There’s a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:

Agree with stakeholder on an MVP (Minimum Viable Product) – something small that can be done quickly
- Your stakeholder may not know exactly what they want, so you may have lots of freedom here
Gather & clean only the data you need for the MVP
Create MVP dashboard
- Ask your stakeholder questions here, too! You don’t need to go radio silent and many times they’ll appreciate the feedback loop
Present MVP dashboard to stakeholder
Gather feedback from stakeholder
Start process over again

This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better – you’ll quickly identify misalignments, missing data, new requirements, changing business needs and more.

The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.

Conclusion

Don’t try to build Rome in a day on any analytics project. You’ll rarely succeed. Instead, iterate and build on a project until it becomes something useful – and likely looks nothing like what you thought it would starting out.

Analytics is a dynamic field. Don’t fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.

Don’t Get an Analytics Degree

OK OK, I’ll admit it. I’m on a contrarian streak. For good reason – I want to help you with your analytics career and there are common potholes such as overrated technical ability. Analytics degrees are a close second and worth an in-depth discussion.

When I mention “degree” I mean any of the following:

Bachelors/Masters in Analytics
Analytics Boot Camps
Technical Certifications

*There are a few exceptions to this advice, though they are very case-by-case. There may be a specific position you want at your company that requires a degree to get in or you may have a personal to accomplish. I’m not speaking into those situations but still want to acknowledge they exist.

The Allure of Education

It’s logical why many have a thought process like this:

I am interested in analytics
I do not have analytics experience
Hiring managers want to see experience and/or education
Education is the next best option
I will fill in gaps in my resume with education

At face value, this makes complete sense. In other career tracks, education teaches crucial skills and gives you an entry into that industry. Want to get into law? Get a law degree. Want to become a doctor? Get a medical degree.

This is absolutely not the case in analytics. A Masters Degree, Analytics Boot Camp or MSSQL Certification will not give you a leg up for analytics positions. I see post after post after post on data science forums discussing analytics education. A key assumption is rarely called out: “Education will help you get an analytics job.”

Why Classes Struggle to Teach Analytics Skills

I had the privilege representing BI/Analytics on a panel for the University of Washington Information School. I centered on one basic point: it is near impossible for a classroom setting to prepare you for the reality of an analytics career.

Think of it this way: in college, the “game” is well-known. The teacher gives you specific concepts. Your job is to apply those concepts on your homework, tests and/or projects. The requirements are clear and tie back to the class syllabus. Data is typically clean or requires trivial amounts of cleaning to get ready.

Analytics careers are nothing like that. I wrote about how ambiguous data problems are. There’s no syllabus. Clear questions are rare. Even if questions are clear, your stakeholder often asks the wrong question. Data may not exist and any existing data is a mess. The world is ambiguous and cloudy and hard to navigate.

Imagine a college class that tried to replicate this. No syllabus. Little to no data provided. You may or may not have a test, and that test may require you to answer questions not even on the test. Even if there were questions, they may not be the ones the teacher wants you to answer. What a mess of a class!

I’m not sure how to structure a college course to capture the ambiguity in the every day life of an analyst. As Jacob wrote, there are four key soft skills for analysts and I’d be interested to hear of any creative strategies from teachers/professors to teach them. Certainly some get closer than others, but no matter what there is no replacement for the real world.

Why Degrees Don’t Matter

You may have already connected the dots. If courses can’t teach key analytics skills, then various degrees will not make a resume stand out. It’s rare for technical ability to stand out as the reason to hire someone.

Combined with the time & expensive involved with degrees, their value diminishes. Put another way, if you can get better experience AND get paid for it, consider that option first.

In Conclusion – What Now?

Experience is king, period. You may be asking “But how do I get experience without getting my first job?” Great question! This is what I referred to as the ‘Great Filter’ on landing your first analytics job. That post will cover most of what you should do instead of getting a degree.

A note from Jacob: For more on this – lots of good discussion on data twitter & in the Locally Optimistic slack. A snippet of a thread just yesterday is below.

Was just talking to someone looking for tips on preparing for data science interviews and realized I couldn’t give them any concrete answers (“should I study stats? programing? analysis? which models?”) since every single interview is radically different. Unless you’re preparing for a FAANG-style interview where they literally give you a packet of possible questions and guidance, I have no idea how any of us know what to study and get jobs in this industry. I was reminded of @tdhopper‘s great post on this topic. https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs

Originally tweeted by Vicki Boykis (@vboykis) on November 9, 2020.

Three SQL Skills To Pass Technical Assessments

Previous Related Posts:

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles
(4) Technical Ability is Overrated
(5) 4 Soft Skills to Amplify your Analytics Career
(6) Case Study: Solving an Ambiguous Problem

Technical Ability Is Overrated
Me, two weeks ago

This is meant as a companion post and reply to the most common response to Technical Ability Is Overrated. Specifically, “Analysts need technical ability to do their job, and that means it’s important.” I wholeheartedly agree – and while you can’t win a job on technical ability alone, you certainly may lose it.

That means I would be remiss to cover the basic SQL concepts which will put you in a good place in most any Analyst interview. You need to know SQL to apply your business acumen and soft skills and that’s why it’s consistently tested in interviews. Generally if you know the SQL basics, the hiring team will be confident you can refresh/learn any knowledge gaps later on.

However — if you find positions that would disqualify you if you didn’t know something outside of these concepts, that should be a red flag. Those organizations focus too heavily on technical ability and/or the position is more in line with a Data Engineer than a Data Analyst.

Note: I am writing this using Snowflake SQL syntax; there are variations and quirks to each SQL version so some of this may be close but not exact to the environment you are in.

Where to brush up on SQL skills

There are tons of great SQL learning resources online now. One of the best out there is SQLZoo, with great examples and the chance to practice writing SQL to check your syntax. W3Schools also has a great set of tutorials for all sorts of SQL queries.

(0) Demonstrate Previous SQL Work

OK OK, this isn’t one of the four. But if you can demonstrate non-classroom SQL ability either through previous work or on the potential job’s take-home exercise, that’s worth its weight in gold. Mostly people want to know that you can use SQL to solve problems and if you can speak to using complex SQL to get stuff done previously, that goes a long way in checking the box

(1) Left/Right/Inner/Outer Join

These joins are the bread and butter of the SQL world – especially Left and Inner. You need to be able to explain the difference between each quickly and succinctly, as well as pick out which to use if/when tested. I’ll add a quick visual & code example of each borrowed from W3Schools. Read the in-depth explanations at W3Schools or SQLZoo for more details.

Left Join

Returns all records from the left table and the matched records from the right

SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID
ORDER BY Customers.CustomerName

Right Join

Returns all records from the right table and the matched records from the left
This does the same thing as a left join and 99% of the time people use Left.

SELECT Orders.OrderID, Employees.LastName, Employees.FirstName
FROM Orders
RIGHT JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
ORDER BY Orders.OrderID

Inner Join

Selects records that have matching values in both tables

SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID

Outer Join

Returns all records when there is a match in the left OR right tables

SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName

(2) Aggregate Functions

Basic SQL functions involve returning a set of rows that match some criteria (“Show me all transactions from yesterday”). Sometimes, you’ll need to aggregate your data to answer summary questions (“Show me how many transactions we’ve had by day this year”). This is true for situations where you need to count, sum, average or find min/max.

Key Concept 1: Understand query level of detail

If you need to sum up every transaction from a single store, that’s easy – the level of detail is at the day + store level. However, if you want to see how many of each product was sold at each store on each day, suddenly you have three levels of detail (product + day + store). The more details someone needs in the data, the less aggregated it’ll get. Seems obvious but undoubtedly you’ll run into funky results when you THINK you understood the aggregation level but did not understand in reality.

Key Concept 2: Leverage GROUP BY

You need to tell SQL the level of detail in order for it to know how to aggregate your data. It won’t read your mind and if you leave any ambiguity, it’ll either fail to process (which is fine – at least you know) OR it will aggregate a ‘wrong’ answer (which is bad, you might not catch this!)

For example, let’s say you want that store data. If you don’t use Group By you’ll type it out something like:

//Example - this will fail

SELECT order_date, count(transaction_id)
FROM transactions

However that’ll return an error – SQL doesn’t see a GROUP BY statement and will tell you as much. It doesn’t know what to do with “order_date” and that ambiguity makes the query fail. That means you need to add in one more line to make it execute:

//Example - Show count of transactions by date

SELECT order_date, count(transaction_id)
FROM transactions
GROUP BY order_date

Key Concept 3: Filter aggregations with HAVING

Let’s say someone wants to only see days where there were at least 100 transactions. If you aren’t familiar with aggregations, you might write something like this:

//Example - this will fail due to using WHERE instead of HAVING

SELECT order_date, count(transaction_id)
FROM transactions
WHERE count(transaction_id) >= 100
GROUP BY order_date

SQL will throw you an error. The WHERE clause is to filter out individual rows – except the count() function looks at multiple rows at once. It won’t know what to do! Someone doing a SQL test will see if you pick up on this when whiteboarding a problem – this is a common ‘gotcha’ question.

SQL provides the HAVING clause to allow you to filter on an aggregated column, like so:

//Example - Dates with at least 100 transactions

SELECT order_date, count(transaction_id)
 as "TRANSACTIONS"
FROM transactions
GROUP BY order_date
HAVING count(transaction_id) >= 100

Notice that HAVING comes after GROUP BY – while not absolutely critical to remember the order of these, it’s a ‘nice to have’ if you can on the fly remember the order in which SQL clauses execute.

(3) Subqueries

This is typically the upper limit of SQL testing for Data Analyst jobs. Commonly you will need to use a subquery to pull in data into the SELECT, FROM or WHERE clauses. SQL is a very flexible language and you can use a subquery to define a secondary table with its own select/from/where logic that is separate from your main query.

For instance, let’s say that we want to adjust our query from part (2) and now we want an additional filter – a list of all days with at least 100 transactions AND at least one customer was from California. It may seem simple at first, but it turns out this requires a subquery.

Let’s say you initially try just adding

//Example - Initial attempt, adding in Customer table and...
//...adding in a filter for customers from California

SELECT 
    order_date, 
    count(transaction_id) as "TRANSACTIONS"
 
FROM transactions t
LEFT JOIN customer c on c.customer_id = t.customer_id
WHERE c.customer_state = 'CA'
GROUP BY order_date
HAVING count(transaction_id) >= 100

The above will return a result but it’ll be wrong. When you put the California filter in, the SQL script filters down to only rows with California customers. All other sales are removed. This means your numbers come out very low. In fact, now your query is returning a list of all days with at least 100 transactions solely from customers in California.

So, how do you use Calfornia as a filter in the aggregate without having it be part of your base query? This is where a subquery comes in. Let’s rewrite this as a sub-query in the LEFT JOIN statement and THEN use it in a where statement.

//Example - Add Subquery into WHERE clause

SELECT 
    order_date, 
    count(transaction_id) as "TRANSACTIONS"
 
FROM transactions t

//Add in Subquery into WHERE clause

WHERE t.order_date IN

    (SELECT distinct(t2.order_date)
     FROM transactions t2
     INNER JOIN customer c on
         c.customer_id = t2.customer_id AND
         c.customer_state = 'CA'
    )
    
GROUP BY order_date
HAVING count(transaction_id) >= 100

There are two crucial pieces here:
(1) I created a subquery finding days with sales to California
(2) I put that subquery into the WHERE clause to filter to those days

This is the flexibility of subqueries – I got to use a different level of detail to filter my base query. This is one of the most complex concepts you may be tested on. I considered subqueries to be right at the edge of “Expected” and “Nice to Have” and that line can be blurry elsewhere – so preparation is key here.

Bonus: CTEs

Common Table Expressions (CTEs) are becoming more and more popular. In fact, so popular that I’ve shunted the more complex subqueries in favor for CTEs. They essentially allow you to make something that acts and feels like a table, but only exists for as long as your query lasts. It is extra readable (think of how complex some subqueries can get in a long set of code!), and that readability is what makes it so powerful.

While I won’t cover the comparison fully today, I’ll save this discussion for a future blog.

In Conclusion

If you come to an interview with examples of previous SQL work and/or a knowledge of Joins/Aggregate Functions/Subqueries, you’ll do fine on the technical assessment for most roles. This is the basic toolset needed for analysts to get the job done and allows you to leverage your business acumen and soft skills.

Duelers’s Note: Jacob here. I’ve found it incredibly useful to keep a book around for reference purposes that I can dog-ear, highlight, and otherwise markup. Since I’m primarily in the MS stack, I heartily recommend “T-SQL Fundamentals” by Ben-Gan Itzik. There are great references for other SQL flavors too – but you will need to do your own research to find them.

Case Study: Solving an Ambiguous Problem

Previous Related Posts

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles
(4) Technical Ability is Overrated
(5) 4 Soft Skills to Amplify your Analytics Career

Identifying an Ambiguous Problem

The past two Data Duel blogs deemphasized technical ability and touted soft skills as crucial for an analytics career. My goal is to bring the discussion out of the theoretical and into a practical example from my own career where I applied those four soft skills in an analytics context. In fact, the lessons I learned in the following example are ones I regularly utilize 6+ years later.

To define terms, “Ambiguous Problem” is one which no one clearly defines and for which no one provides a clear solution.

Let’s go back to 2014. I’m working at a manufacturing/distribution company. Growth is starting to explode, and I’m working as the org’s only data analyst. Reporting into the SVP of Sales, my desk is on the sales floor in the middle of ringing phones and reps busy entering orders into our system.

After a few months I notice something interesting. Nearly every rep has an Excel spreadsheet called the “Sales Catalog” up when they’re on the phone, showing various items for sale. Sometimes the row says “In Stock” and sometimes in red it says “Out of Stock”. I also heard grumbling – the sheet isn’t right. They’ll tell a customer “Yes that item is in stock” but when they go to order it, the system denies the request. Yikes, not a great experience for the customer or rep.

The process to correct data errors was also bumpy. Excel only allows one person to update a shared network file at once, and that person is the SVP’s Executive Assistant. Reps would call or ping the EA, telling them what update to make to the Excel file. Then all the reps had to close & re-load Excel to get the up-to-date information.

As I noticed these issues stacking up, I heavily leveraged empathy and curiosity to understand what the reps wanted to accomplish and why we ended up in this rather inefficient place. I talked to multiple people across the organization – sales reps, sales managers, systems – to make sure I had a grasp of everything.

These conversations let me take an ambiguous problem and define it:

Reps can’t get accurate and timely in-stock data to their customers.

Crucially, no one told me about this data problem or how to solve it. It was up to me to define and solve it

Developing a Solution

With the problem defined, it was time to work on a solution. This is where organization became crucial. As you may have noticed, there isn’t a single solution to this problem. Instead, I needed to break it down into sub-tasks:

(1) Figure out where the true ‘In Stock’ data in the system is stored
(2) Create SQL script to retrieve that data
(3) Get that data into Excel for the sales floor (It’s OK to keep something in a format familiar to them, even if it’s not fully optimal)
(4) Make the report better! (Wouldn’t it be cool if instead of just saying ‘out of stock’ it said when it would be back in stock?)
(5) Discuss V1 with leadership & iterate as needed before launch
(6) Launch new tool with training/documentation

Each of these steps was non-trivial. I had to dive into our database and really understand how the items moved into and out of stock. I had to figure out how to write an accurate SQL script to replicate those movements. I had to figure out how to connect SQL tables into Excel and create a reliable pipeline. All while making sure I kept a similar form-factor for the sales floor to maximize adoption.

In the midst of completing each step, I made sure to understand the accuracy needed. “Good Enough” data wasn’t necessarily clear. For instance, I added in some buffer to what ‘In Stock’ meant due to how fast-moving the data was. Items went into and out of stock quickly. I wanted to minimize scenarios where my document said ‘In Stock’ and yet the system didn’t let the customer put in an order. Additionally, I needed to hedge on when an item would be ‘Back In Stock’ — more on this in a later post!

Critically, I also went through development cycle with leadership and other trusted Sales team members to make sure what I made would match their needs. They would see an early draft, give feedback and I would fix that before starting the cycle again. This is again where empathy came into play – I needed to understand their problem and make sure what I created actually solved it, rather than assuming.

In Conclusion

Analysts provide massive value by identifying and solving ambiguous data problems. I learned that early on with this Sales Catalog example. I liberally applied each of the four soft skills to go from problem identification to problem solution:

Curiosity: Dug into what the reps were trying to do and what problems their existing solution created

Accuracy: Determined the tolerance of “Good Enough” data, both due to database limitations and hedging where I would prefer inaccuracy to rest

Organization: Broke down the problem into sub problems, which built into my final solution

Empathy: From start to finish, I made sure to listen to many voices across the team – both in understanding the problem and making sure my solution actually made their work lives easier

This same cycle has served me well time and time again in an analytics career. If you can proactively discover analytics problems and solve the important ones, you’ll quickly provide value to any company lucky enough to have you aboard.

Technical Ability is Overrated

Previous Entries in Everything Analytics

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles

Tunnel Vision on Technical Ability

If you were to ask someone “What skills are the hallmark of a data analyst?” the answers consistently center around technical ability: SQL, Python, R, Tableau, Power BI. The same shows up on most job postings – technical ability listed first.

That means it’s unsurprising when aspirational analysts focus heavily on “What technical skills / certifications do I need to be competitive for an open position?” To hammer the point home, I took a look at the Weekly Entering & Transitioning post at the Data Science subreddit. While not Data Analyst specific, there is a ton of overlap between people interested in Data Science and Data Analytics. Here are some excerpts:

“How are entry level prospects for someone with a bachelors in data science?“

“The main concern is that I don’t have any basic knowledge in any C language.“

“I have been teaching myself SQL/Python/HTML through CodeCademy pro“

This repeats week after week after week – never ending inquiries about the technical side of the job. In the words of Morpheus – what if I were to tell you…that technical ability will not win you an analytics job? This has held true both for me getting into analytics jobs, as well as interviewing many others for analytics positions.

There’s far more to a well-rounded Data Analyst, as someone in that same Reddit thread rightly identified: “While it’s easy to find resources to learn technical/mathematical skills, which I have been doing. Are there any resources for practising problem solving in the context of data analysis“

The Two Axes of an Analyst

Below is a quadrant depiction of how analysts are assessed in interviews and in their day-to-day. “Technical Ability” isn’t listed here.

Don’t get me wrong – technical ability is absolutely important. If you have no technical ability you’ll struggle to get the data you need to do your job.

But technical ability is just a means to an end. And it’s the most teachable type of skill out there! Even if there’s a gap, it’s easy to overcome with training. Business Acumen and Soft Skills are much more difficult to uplevel. I learned this lesson firsthand:

The Smartsheet Director of BI interviewed me three years ago for a Senior Analyst position. At the end of a 45 minute discussion, I realized I hadn’t been asked a single technical question. Not one check for SQL, or Python, or Tableau skill. So I asked, “Why didn’t you discuss my technical ability? Are you just trusting I know my stuff?” The director sat back, chuckled, and replied, “I only need to know how you think — if you have technical gaps we can fill those quickly.”

Technical Ability as a Multiplier

So, what place does technical ability have if it isn’t what analysts are measured on? It’s a multiplier – a 21st century career rocket fuel.

There are countless business leaders who have excellent acumen and soft skills. The C-Suites and corner offices are filled with those individuals.

As a data analyst you leverage technical ability to multiply how well you apply your soft skills and business acumen. Suddenly you’ll find yourself at tables you otherwise would never have seen, discussing critical business questions with C-Level individuals. Finding patterns in data requires technical ability, and data-driven stories are phenomenally powerful when wielded with strong soft skills.

In Conclusion

There is a massive focus on technical ability when really that’s just a multiplier for the core skillsets a data analyst brings to the table. As you read in last week’s post, Data Analysts help the business make better decisions leveraging data. That involves connecting the data to business problems utilizing Business Acumen and effectively/persuasively communicating findings with Soft Skills.

Don’t just take my word for it – perhaps at this point you’re wondering “What are these soft skills and how do I develop them?”. Jacob has just the post for you – see 4 Soft Skills to Amplify Your Analytics Career.

Defining Analytics Titles

Previous entries in Everything Analytics:

The Many Wandering Paths to Analytics
Landing Your First Analytics Job

Confusing Web of Titles

The analytics space is rapidly growing & evolving, and this fast growth has led to a convoluted web of job titles which are overlapping and contradictory. I’m here to help you sort through some of those job titles and even open your eyes to different types of Analytics job you may not have considered. Data Scientists get all the press – but there are many more roles out there which may be a better fit for your interest & skills.

You will rarely find consistency from company to company. In fact, I’ll start with a couple disclaimers:

Disclaimer 1: Keep in mind that my opinion on the separation between these jobs has no bearing on how the HR department of a company defines their positions.

Disclaimer 2: This list is not exhaustive. There are lots of substructures to these roles as well as other data-adjacent or niche jobs which exist.

Keep In Mind When Applying

Make sure you absolutely understand the job description and ask many clarifying questions during interview rounds to fully understand what you’ll be doing. If you aren’t thorough in evaluating the job, you may not end up with the work you thought you’d be doing.

Example Job Titles

Data Scientist
(Related: Statistician)
Data Analyst
(Related: BI Analyst)
Data Engineer
(Related: BI Architect, BI Engineer)
Business Analyst
(Related: Technical Project Manager)
Machine Learning Engineer
(Related: Software Engineer)

Reporting Structure

While there is no one-size-fits-all structure, there are general trends:

Data Scientist/Data Analyst/Business Analyst

These roles may report to any part of the business, depending on how centralized the data organization is. The more centralized, the more likely they are on the same team. Sometimes they may be their own team entirely, rolling to the CEO independent of any other C-Suite leader. Other times they may roll up through the COO or CTO.

Other times, they may be decentralized and be scattered across the company with no specific structure.

Data Engineer / Machine Learning Engineer

Typically these fall under the CTO. Data Engineers may be under IT, or may be their own division. MLEs typically fall into Software Engineering — see below for more discussion.

Detailed Breakdowns

Data Scientist

Overview: This is the most-publicized job title out there and therefore is the broadest; it can mean many things at many places.

Data scientists are forward-looking and focus on predictive analytics. They certainly can do descriptive analytics, but their value comes from modeling/classification/etc.

Due to the emphasis on modeling, data scientists typically have advanced degrees in statistics, applied math, information science, or similar.

Example task: Predict how much stock of each item a company should order from its manufacturers in advance of the holiday season.

Data Analyst

Overview: While Data Scientists are generally forward-looking, data analysts are generally backward-looking and more entrenched in the business. Their job is to help the business understand what has happened up to this point and provide data in a clear & concise way for decision making in the future.

Typically data analysts focus heavily on making visualizations and presentations for the business and bridge the gap between the business and the data.

Data Analysts are also more jack-of-all-trades. It’s common to do a bit of data science, analytics, engineering, and PMing in a single role.

Example task: Create a flexible Tableau dashboard for leadership to track trial conversion to paid users over time

Data Engineer

Overview: Data engineers work on the databases that the other members of the analytics org use to get information to stakeholders. They are responsible for bringing data from the business into some form of data warehouse in an accurate, timely and secure fashion.

This means the typical customers of data engineers are the data analysts/scientists at the company. They also may work directly with different parts of the business as they want their own data ingested automatically into the larger data warehouse.

This role is typically more technical and code-heavy in order to move massive amounts of data around at scale. There is less interaction with the business than other parts of the analytics organization.

Example task: Mirror Salesforce data in a schema in Snowflake, updated every 5 minutes, for analysts & scientists to analyze/visualize

Business Analyst

Overview: Business Analysts are sometimes called a “project/product/program manager”, or PM. No matter the name, they are distinct from Data Analysts/Scientists in an important way. They coordinate and organize data projects across the business.

This role typically doesn’t exist early on in a data team’s existence. Usually individual analysts/scientists take this on until the burden of project managing starts outweighing time spent actually doing analysis. Eventually, the role of Business Analyst comes along.

Business Analysts are not expected to code or be as savvy on the technical side. Rather, their job is to identify problems, gather requirements, allocate resources and coordinate expectations between the data team and the business. This is no small task as many technically minded individuals are great at doing an analysis when there’s a clear question, but struggle to work with non-technical individuals across the organization.

Example task: Sales wants standardized KPI dashboards across their worldwide teams available for next quarter’s SKO

Machine Learning (ML) Engineer

Overview: This is a bonus position added in, largely since I see Machine Learning discussed commonly on Analytics forums and many of you may be wondering how it fits into a data org. The short answer: this role doesn’t fit into the data org per se.

Specifically, this role is commonly found on the Software Engineering (SWE) org and is more of a Software Engineer with an ML focus than anything else. This role is most similar to a data scientist and usually is more involved with implementing models within the production code of the company to solve whatever problem has been identified.

Example task: Predict which users on the website may want to know about Feature X, which will prompt an informational pop-up

In Conclusion

There are all sorts of roles to explore and this list is by no means exhaustive. As I mentioned at the start, the names above may be conflated with each other at any given job you apply to. Regardless, this gives you some guardrails around what sorts of roles are out there — from non-technical to technical and everything in between.