Categories
analytics analyticsengineering businessintelligence

Running a personal SQL server for free

For some, getting into data analytics outside of an academic or work environment can be very challenging - where do you start? Which database do you use? And how do you do it for low or zero cost?

In this article, I am going to walk through setting up your VM1 & database, connecting to your new remote server using Azure Data Studio, and as a bonus, connecting it to dbt. I've also written about setting up dbt on windows on a previous post.

First, let's talk about requirements & recommendations:

  1. This tutorial is focused on Windows 10 + Linux. You will need Windows 10 Pro where you install your VM.
  2. I recommend that you set up your database on different physical machine than your dev machine. You should probably have at least 32GB of RAM.
  3. Since we are installing the database on another machine, that machine needs to be on the same network as your development machine.

Why use a VM at all? In my experience, running a database on your dev machine makes everything extremely slow. Your database will be very greedy with resources (RAM specifically) - so keeping it in a little box that you can turn on and off allows you to keep using your machine "as normal".

Step 1: Enable HyperV

Open powershell as administrator and run the following command:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All

More info can be found here: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v

Step 2: Create a VM in HyperV

You will need to restart your machine in order to use the HyperV features, so machine sure to do that first. The Microsoft documents to create a VM are exellent - and linked below. Make sure to select Ubuntu 20.04 when you create it.

https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/quick-create-virtual-machine

Step 3: Install SQL Server on your VM

We will do the install of SQL Server2 in the CLI on Ubuntu, which MS has laid out again very nicely in their documentation. A couple of notes when walking through this:

  1. Make sure to select "SQL Server Express" as your edition. It limits your database size to 9GB but is otherwise relatively unencumbered by MS licensing.
  2. Write down your SA password. You will need it later when connecting.

This is quite detailed, so head over to this link and follow the instructions in detail: https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup?view=sql-server-ver15

Step 4: Update the settings of your virtual switch

The default settings inside HyperV is for an "internal network" on your VM. This is fine if you are accessing your VM from the machine its running on, but the whole point here is that you want it to be a "remote server". Set the virtual switch to "external network" and you can then access your VM from any machine on your network.

Again, MS has great documentation on this here: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/get-started/create-a-virtual-switch-for-hyper-v-virtual-machines

Step 5: Install Azure Data Studio on your dev machine - and write some SQL!

On your dev machine, make sure you can ping your VM. In my case, my VM is named "jacob-virtual-machine", so the command to validate I can reach it is:

ping jacob-virtual-machine

If you can't ping your VM, you have some networking issues to sort out. While I am no expert here, you will want to make sure you can see your VM outside the host (Step 4, above) and that port 1433 is open on the host and the VM.

Once that is resolved, you can download and install Azure Data Studio3. Now, with the credentials from above and you VM name, you can connect to your remote server. Everything can be left on defaults, but the avoidance of doubt, check out my connection settings below.

SQL Server Connection Settings

Now you have it all working and you have your own nice empty database to play with!

Bonus Content: Connect dbt to SQL Server

For those of you wishing to use dbt with SQL Server, check out the dbt-sqlserver github. It has great details, but I'll summarize the key bits.

You will need to install the dbt connector:

pip install dbt-sqlserver

I also find their explanation of the profiles.yml file kind of confusing, so I've included my own below for reference:

local_sql:
  target: dev
  outputs:
    dev: 
      type: sqlserver
      driver: 'ODBC Driver 17 for SQL Server'
      server: <VM name>
      database: <database name>
      port: 1433
      schema: <schema name>
      user: <username>
      password: <password>

Footnotes

1 You can also probably do this with WSL2, and not install a Linux VM. However, I am going to be running more software on the VM later and I want to split it to another machine. You can also use docker over top of all of this, which I may cover in another post.
2 I'm choosing SQL Server for a couple reasons: I am familiar with it and the documentation and community are large. PostgreSQL also works here, which has the advantage of having a default dbt connector.
3 SSMS works here too, but Azure Data Studio has the advantage of being cross platform. If you are using dbt, you need a SQL runner anyway as the VS code options aren't great.

Categories
analytics bi data viz

Write Code Last - 4 steps to better dashboards

I gave a talk last week about "Data to Dashboard" and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject - I hope you enjoy it.

Toronto Data Workshop - 6/18/2021
Categories
analytics analyticsengineering everythinganalytics

Install dbt on Win10 - April 2021 (Updated October 2021)

I was getting a little frustrated with the web interface of dbt cloud, and just wanted to feel more "in control" of my environment. Which lead to this twitter thread:

Which led to lots of good recommendations (for Atom, POP SQL, vim+tmux, DeepChannel, and some others) but ultimately I settled on VS Code after a few auspicious DMs.

The process to install dbt on Win10 isn't exactly friendly for an analyst using dbt without engineering experience, so I wanted to share my journey and hopefully make yours easier too.

Step 0: Install Python

Note: These steps have been tested with Python 3.9.6 and earlier. As of October 15th, 2021, Python 3.10.x is not working!

Before you do anything else, install python. Make sure to check the box to "Add Python to PATH". If you don't, you can only run it explicitly.

low quality screenshot - but peep the box at the bottom.

If you miss this step, you have two options:
1 - re-install of python and check the "Add Python to PATH" box.
2 - manually add the PATH for python; one example linked here.

To confirm it's working, open the command line and enter 'py --version'. It should return the version of python you installed. If you get an error, most likely it is a PATH issue.

Step 1: Install VS Code + MS Build tools

First, download links: VS Code, MS Build tools.

As a quick call out, you need MS Build tools for MSVC v140 or higher, which is an optional component of the C++ build tools.

more low quality images, but check the right box.

The MS Build tool install takes a bit of time, so I would do this when you have a little bit of time (15 minutes or so), and you will need to restart your computer.

Step 2: Config your VS Code Environment

There are a few items that need to be done to get VS Code ready for dbt, so I will list them here.

  • Open the command palette (Ctrl+Shift+P), and type 'Python: Select Interpreter'. It should then bring up and allow you to select your python interpreter.
  • Add the dbt power user plug-in.
  • Take a look at this article for more suggested plug-ins from the dbt team.
  • Open your command line, and update pip to the latest version with the command 'pip install --upgrade pip'.
    • If you get an error here, you may need to run it with the '--user' flag as well, but I got that behavior inconsistently.

Step 3: Install dbt on Win10

If you've made it this far, congrats. If you are finding this article because your 'pip install dbt' isn't working, go back to the top and work through the above steps first.

Note: As of October 15th, 2021, this is paragraph is no longer required, but I've kept it for historical reasons. In your command line, run the following: 'pip install dbt --no-use-pep517 cryptography'. The dependency chain is broken somewhere and this flag fixes it. Do I know why? No. Do I care? Also, no.

Go ahead and run 'pip install dbt' in your CLI. This takes a few minutes to run (5-10 mins), but when it's done, you can check by running 'dbt --version' in the CLI. It should return the latest version (as of this update, 0.20.1).

Lastly, I recommend running 'dbt init' to set up your initial '.dbt' folder that holds your profiles.yml file to allow you to connect to your data warehouse. If you don't run it, you will just need to create that file by hand later. Since that file contains your credentials, it is best practice to put that in another place outside of your source control.

And with that, you are ready to connect to your repo and begin working on your dbt project. And lastly - share your Ws on twitter!

Categories
analytics bi businessintelligence everythinganalytics Uncategorized

Start Simple With Your Analytics Project

Start Simple & Iterate

Up to this point, I've largely written for those looking to break into an analytics career. Today I'll go beyond that and discuss the most powerful lesson I and many others learned -- something I wish I fully understood starting out:

Start your analytics project as simple as possible and iterate from there.

This strategy borrows a lot from Agile software development not because I'm a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.

The Common Mistake

I'm going to assume you've already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.

Many analysts (myself included!) are then tempted to:

  • Retreat to your office
  • Gather & clean all the data you think everyone needs
  • Build the World's Best V1 Dashboard
  • Schedule a meeting to present the dashboard
  • Receive unanimous praise for how amazing it is
  • Watch as everyone uses your dashboard daily

What really happens:

  • Retreat to your office
  • Gather & clean only some of the data people need
  • Spend way too long building the Dashboard No One Really Wanted
  • Stakeholders email you intermittently asking if you're making progress
  • Schedule a meeting to present the dashboard
  • Entire meeting spent fielding questions like "Why don't I see X or Y?"
  • Get the cold sweats realizing you don't have what they need
  • Stakeholders frustrated that so much dev time was wasted
  • You're frustrated that they are "changing what they need"
  • Retreat to your office

Why Does This Happen?

Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you'll never make that mistake again. No one is immune.

There is one core reason why this happens: You assume you understand what the stakeholder wants.

Except you likely don't. Especially when you're early in your career. You'll think you're on the same page with your stakeholder, but you aren't. You think you know what data points the stakeholder needs, but you don't (hint: the stakeholder likely doesn't know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don't.

In fact, it's so difficult to get everything right the first time, you should assume you don't fully understand the request. That one time you actually do build "The World's Best V1 Dashboard", celebrate the unexpected success - it won't happen often.

Strategy: Start Simple

There's a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:

  • Agree with stakeholder on an MVP (Minimum Viable Product) - something small that can be done quickly
    • Your stakeholder may not know exactly what they want, so you may have lots of freedom here
  • Gather & clean only the data you need for the MVP
  • Create MVP dashboard
    • Ask your stakeholder questions here, too! You don't need to go radio silent and many times they'll appreciate the feedback loop
  • Present MVP dashboard to stakeholder
  • Gather feedback from stakeholder
  • Start process over again

This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better - you'll quickly identify misalignments, missing data, new requirements, changing business needs and more.

The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.

Conclusion

Don't try to build Rome in a day on any analytics project. You'll rarely succeed. Instead, iterate and build on a project until it becomes something useful - and likely looks nothing like what you thought it would starting out.

Analytics is a dynamic field. Don't fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.

Categories
bi businessintelligence everythinganalytics

Medium Data: MS edition

This video is for your data that is too big for an excel spreadsheet and too small for a data warehouse. I like to refer to this as "Medium Data".

I can think of many times I needed this during my career. Typically, the "medium data" scenarios were related to snapshotting historical data weekly and showing changes in trends over time. One good trick I learned in one of my first jobs was to snapshot my CRM order book every week and save it in a CSV format. Eventually, that got too large for my meager tools, and I started aggregating, losing data, or other hacks (i.e., multiple excel files). Linking excel files together was basically enough to motivate me to learn SQL. With Azure, you can easily scale into the next size of data and keep your analytics rolling. Check the video below for a 15 min walk through.

Going from CSV to SQL in 16 minutes

I’ve just shown the basics - but there are some awesome articles out there that can go more in-depth, including some great automation.

The core tutorial in this video can be found here: https://social.technet.microsoft.com/wiki/contents/articles/52061.t-sql-bulk-insert-azure-csv-blob-into-azure-sql-database.aspx

To really amp it up with automatic import, check out this: https://marczak.io/posts/azure-loading-csv-to-sql/

Categories
analytics bi businessintelligence everythinganalytics

Don't Get an Analytics Degree

OK OK, I'll admit it. I'm on a contrarian streak. For good reason - I want to help you with your analytics career and there are common potholes such as overrated technical ability. Analytics degrees are a close second and worth an in-depth discussion.

When I mention "degree" I mean any of the following:

  • Bachelors/Masters in Analytics
  • Analytics Boot Camps
  • Technical Certifications

*There are a few exceptions to this advice, though they are very case-by-case. There may be a specific position you want at your company that requires a degree to get in or you may have a personal to accomplish. I'm not speaking into those situations but still want to acknowledge they exist.

The Allure of Education

It's logical why many have a thought process like this:

  • I am interested in analytics
  • I do not have analytics experience
  • Hiring managers want to see experience and/or education
  • Education is the next best option
  • I will fill in gaps in my resume with education

At face value, this makes complete sense. In other career tracks, education teaches crucial skills and gives you an entry into that industry. Want to get into law? Get a law degree. Want to become a doctor? Get a medical degree.

This is absolutely not the case in analytics. A Masters Degree, Analytics Boot Camp or MSSQL Certification will not give you a leg up for analytics positions. I see post after post after post on data science forums discussing analytics education. A key assumption is rarely called out: "Education will help you get an analytics job."

Why Classes Struggle to Teach Analytics Skills

I had the privilege representing BI/Analytics on a panel for the University of Washington Information School. I centered on one basic point: it is near impossible for a classroom setting to prepare you for the reality of an analytics career.

Think of it this way: in college, the "game" is well-known. The teacher gives you specific concepts. Your job is to apply those concepts on your homework, tests and/or projects. The requirements are clear and tie back to the class syllabus. Data is typically clean or requires trivial amounts of cleaning to get ready.

Analytics careers are nothing like that. I wrote about how ambiguous data problems are. There's no syllabus. Clear questions are rare. Even if questions are clear, your stakeholder often asks the wrong question. Data may not exist and any existing data is a mess. The world is ambiguous and cloudy and hard to navigate.

Imagine a college class that tried to replicate this. No syllabus. Little to no data provided. You may or may not have a test, and that test may require you to answer questions not even on the test. Even if there were questions, they may not be the ones the teacher wants you to answer. What a mess of a class!

I'm not sure how to structure a college course to capture the ambiguity in the every day life of an analyst. As Jacob wrote, there are four key soft skills for analysts and I'd be interested to hear of any creative strategies from teachers/professors to teach them. Certainly some get closer than others, but no matter what there is no replacement for the real world.

Why Degrees Don't Matter

You may have already connected the dots. If courses can't teach key analytics skills, then various degrees will not make a resume stand out. It's rare for technical ability to stand out as the reason to hire someone.

Combined with the time & expensive involved with degrees, their value diminishes. Put another way, if you can get better experience AND get paid for it, consider that option first.

In Conclusion - What Now?

Experience is king, period. You may be asking "But how do I get experience without getting my first job?" Great question! This is what I referred to as the 'Great Filter' on landing your first analytics job. That post will cover most of what you should do instead of getting a degree.

A note from Jacob: For more on this - lots of good discussion on data twitter & in the Locally Optimistic slack. A snippet of a thread just yesterday is below.

Was just talking to someone looking for tips on preparing for data science interviews and realized I couldn't give them any concrete answers ("should I study stats? programing? analysis? which models?") since every single interview is radically different. Unless you're preparing for a FAANG-style interview where they literally give you a packet of possible questions and guidance, I have no idea how any of us know what to study and get jobs in this industry. I was reminded of @tdhopper's great post on this topic. https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs

Originally tweeted by Vicki Boykis (@vboykis) on November 9, 2020.

Categories
analytics bi businessintelligence everythinganalytics

Three SQL Skills To Pass Technical Assessments

Previous Related Posts:

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles
(4) Technical Ability is Overrated
(5) 4 Soft Skills to Amplify your Analytics Career
(6) Case Study: Solving an Ambiguous Problem

Technical Ability Is Overrated

Me, two weeks ago

This is meant as a companion post and reply to the most common response to Technical Ability Is Overrated. Specifically, "Analysts need technical ability to do their job, and that means it's important." I wholeheartedly agree - and while you can't win a job on technical ability alone, you certainly may lose it.

That means I would be remiss to cover the basic SQL concepts which will put you in a good place in most any Analyst interview. You need to know SQL to apply your business acumen and soft skills and that's why it's consistently tested in interviews. Generally if you know the SQL basics, the hiring team will be confident you can refresh/learn any knowledge gaps later on.

However -- if you find positions that would disqualify you if you didn't know something outside of these concepts, that should be a red flag. Those organizations focus too heavily on technical ability and/or the position is more in line with a Data Engineer than a Data Analyst.

Note: I am writing this using Snowflake SQL syntax; there are variations and quirks to each SQL version so some of this may be close but not exact to the environment you are in.

Where to brush up on SQL skills

There are tons of great SQL learning resources online now. One of the best out there is SQLZoo, with great examples and the chance to practice writing SQL to check your syntax. W3Schools also has a great set of tutorials for all sorts of SQL queries.

(0) Demonstrate Previous SQL Work

OK OK, this isn't one of the four. But if you can demonstrate non-classroom SQL ability either through previous work or on the potential job's take-home exercise, that's worth its weight in gold. Mostly people want to know that you can use SQL to solve problems and if you can speak to using complex SQL to get stuff done previously, that goes a long way in checking the box

(1) Left/Right/Inner/Outer Join

These joins are the bread and butter of the SQL world - especially Left and Inner. You need to be able to explain the difference between each quickly and succinctly, as well as pick out which to use if/when tested. I'll add a quick visual & code example of each borrowed from W3Schools. Read the in-depth explanations at W3Schools or SQLZoo for more details.

Left Join

Returns all records from the left table and the matched records from the right
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID
ORDER BY Customers.CustomerName

Right Join

Returns all records from the right table and the matched records from the left
This does the same thing as a left join and 99% of the time people use Left.
SELECT Orders.OrderID, Employees.LastName, Employees.FirstName
FROM Orders
RIGHT JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
ORDER BY Orders.OrderID

Inner Join

Selects records that have matching values in both tables
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID

Outer Join

Returns all records when there is a match in the left OR right tables
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName

(2) Aggregate Functions

Basic SQL functions involve returning a set of rows that match some criteria ("Show me all transactions from yesterday"). Sometimes, you'll need to aggregate your data to answer summary questions ("Show me how many transactions we've had by day this year"). This is true for situations where you need to count, sum, average or find min/max.

Key Concept 1: Understand query level of detail

If you need to sum up every transaction from a single store, that's easy - the level of detail is at the day + store level. However, if you want to see how many of each product was sold at each store on each day, suddenly you have three levels of detail (product + day + store). The more details someone needs in the data, the less aggregated it'll get. Seems obvious but undoubtedly you'll run into funky results when you THINK you understood the aggregation level but did not understand in reality.

Key Concept 2: Leverage GROUP BY

You need to tell SQL the level of detail in order for it to know how to aggregate your data. It won't read your mind and if you leave any ambiguity, it'll either fail to process (which is fine - at least you know) OR it will aggregate a 'wrong' answer (which is bad, you might not catch this!)

For example, let's say you want that store data. If you don't use Group By you'll type it out something like:

//Example - this will fail

SELECT order_date, count(transaction_id)
FROM transactions

However that'll return an error - SQL doesn't see a GROUP BY statement and will tell you as much. It doesn't know what to do with "order_date" and that ambiguity makes the query fail. That means you need to add in one more line to make it execute:

//Example - Show count of transactions by date

SELECT order_date, count(transaction_id)
FROM transactions
GROUP BY order_date

Key Concept 3: Filter aggregations with HAVING

Let's say someone wants to only see days where there were at least 100 transactions. If you aren't familiar with aggregations, you might write something like this:

//Example - this will fail due to using WHERE instead of HAVING

SELECT order_date, count(transaction_id)
FROM transactions
WHERE count(transaction_id) >= 100
GROUP BY order_date

SQL will throw you an error. The WHERE clause is to filter out individual rows - except the count() function looks at multiple rows at once. It won't know what to do! Someone doing a SQL test will see if you pick up on this when whiteboarding a problem - this is a common 'gotcha' question.

SQL provides the HAVING clause to allow you to filter on an aggregated column, like so:

//Example - Dates with at least 100 transactions

SELECT order_date, count(transaction_id)
 as "TRANSACTIONS"
FROM transactions
GROUP BY order_date
HAVING count(transaction_id) >= 100

Notice that HAVING comes after GROUP BY - while not absolutely critical to remember the order of these, it's a 'nice to have' if you can on the fly remember the order in which SQL clauses execute.

(3) Subqueries

This is typically the upper limit of SQL testing for Data Analyst jobs. Commonly you will need to use a subquery to pull in data into the SELECT, FROM or WHERE clauses. SQL is a very flexible language and you can use a subquery to define a secondary table with its own select/from/where logic that is separate from your main query.

For instance, let's say that we want to adjust our query from part (2) and now we want an additional filter - a list of all days with at least 100 transactions AND at least one customer was from California. It may seem simple at first, but it turns out this requires a subquery.

Let's say you initially try just adding

//Example - Initial attempt, adding in Customer table and...
//...adding in a filter for customers from California

SELECT 
    order_date, 
    count(transaction_id) as "TRANSACTIONS"
 
FROM transactions t
LEFT JOIN customer c on c.customer_id = t.customer_id
WHERE c.customer_state = 'CA'
GROUP BY order_date
HAVING count(transaction_id) >= 100

The above will return a result but it'll be wrong. When you put the California filter in, the SQL script filters down to only rows with California customers. All other sales are removed. This means your numbers come out very low. In fact, now your query is returning a list of all days with at least 100 transactions solely from customers in California.

So, how do you use Calfornia as a filter in the aggregate without having it be part of your base query? This is where a subquery comes in. Let's rewrite this as a sub-query in the LEFT JOIN statement and THEN use it in a where statement.

//Example - Add Subquery into WHERE clause

SELECT 
    order_date, 
    count(transaction_id) as "TRANSACTIONS"
 
FROM transactions t

//Add in Subquery into WHERE clause

WHERE t.order_date IN

    (SELECT distinct(t2.order_date)
     FROM transactions t2
     INNER JOIN customer c on
         c.customer_id = t2.customer_id AND
         c.customer_state = 'CA'
    )
    
GROUP BY order_date
HAVING count(transaction_id) >= 100

There are two crucial pieces here:
(1) I created a subquery finding days with sales to California
(2) I put that subquery into the WHERE clause to filter to those days

This is the flexibility of subqueries - I got to use a different level of detail to filter my base query. This is one of the most complex concepts you may be tested on. I considered subqueries to be right at the edge of "Expected" and "Nice to Have" and that line can be blurry elsewhere - so preparation is key here.

Bonus: CTEs

Common Table Expressions (CTEs) are becoming more and more popular. In fact, so popular that I've shunted the more complex subqueries in favor for CTEs. They essentially allow you to make something that acts and feels like a table, but only exists for as long as your query lasts. It is extra readable (think of how complex some subqueries can get in a long set of code!), and that readability is what makes it so powerful.

While I won't cover the comparison fully today, I'll save this discussion for a future blog.

In Conclusion

If you come to an interview with examples of previous SQL work and/or a knowledge of Joins/Aggregate Functions/Subqueries, you'll do fine on the technical assessment for most roles. This is the basic toolset needed for analysts to get the job done and allows you to leverage your business acumen and soft skills.

Duelers's Note: Jacob here. I've found it incredibly useful to keep a book around for reference purposes that I can dog-ear, highlight, and otherwise markup. Since I'm primarily in the MS stack, I heartily recommend "T-SQL Fundamentals" by Ben-Gan Itzik. There are great references for other SQL flavors too - but you will need to do your own research to find them.

Categories
analytics bi businessintelligence everythinganalytics

Case Study: Solving an Ambiguous Problem

Previous Related Posts

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles
(4) Technical Ability is Overrated
(5) 4 Soft Skills to Amplify your Analytics Career

Identifying an Ambiguous Problem

The past two Data Duel blogs deemphasized technical ability and touted soft skills as crucial for an analytics career. My goal is to bring the discussion out of the theoretical and into a practical example from my own career where I applied those four soft skills in an analytics context. In fact, the lessons I learned in the following example are ones I regularly utilize 6+ years later.

To define terms, "Ambiguous Problem" is one which no one clearly defines and for which no one provides a clear solution.

Let's go back to 2014. I'm working at a manufacturing/distribution company. Growth is starting to explode, and I'm working as the org's only data analyst. Reporting into the SVP of Sales, my desk is on the sales floor in the middle of ringing phones and reps busy entering orders into our system.

After a few months I notice something interesting. Nearly every rep has an Excel spreadsheet called the "Sales Catalog" up when they're on the phone, showing various items for sale. Sometimes the row says "In Stock" and sometimes in red it says "Out of Stock". I also heard grumbling - the sheet isn't right. They'll tell a customer "Yes that item is in stock" but when they go to order it, the system denies the request. Yikes, not a great experience for the customer or rep.

The process to correct data errors was also bumpy. Excel only allows one person to update a shared network file at once, and that person is the SVP's Executive Assistant. Reps would call or ping the EA, telling them what update to make to the Excel file. Then all the reps had to close & re-load Excel to get the up-to-date information.

As I noticed these issues stacking up, I heavily leveraged empathy and curiosity to understand what the reps wanted to accomplish and why we ended up in this rather inefficient place. I talked to multiple people across the organization - sales reps, sales managers, systems - to make sure I had a grasp of everything.

These conversations let me take an ambiguous problem and define it:

Reps can't get accurate and timely in-stock data to their customers.

Crucially, no one told me about this data problem or how to solve it. It was up to me to define and solve it

Developing a Solution

With the problem defined, it was time to work on a solution. This is where organization became crucial. As you may have noticed, there isn't a single solution to this problem. Instead, I needed to break it down into sub-tasks:

(1) Figure out where the true 'In Stock' data in the system is stored
(2) Create SQL script to retrieve that data
(3) Get that data into Excel for the sales floor (It's OK to keep something in a format familiar to them, even if it's not fully optimal)
(4) Make the report better! (Wouldn't it be cool if instead of just saying 'out of stock' it said when it would be back in stock?)
(5) Discuss V1 with leadership & iterate as needed before launch
(6) Launch new tool with training/documentation

Each of these steps was non-trivial. I had to dive into our database and really understand how the items moved into and out of stock. I had to figure out how to write an accurate SQL script to replicate those movements. I had to figure out how to connect SQL tables into Excel and create a reliable pipeline. All while making sure I kept a similar form-factor for the sales floor to maximize adoption.

In the midst of completing each step, I made sure to understand the accuracy needed. "Good Enough" data wasn't necessarily clear. For instance, I added in some buffer to what 'In Stock' meant due to how fast-moving the data was. Items went into and out of stock quickly. I wanted to minimize scenarios where my document said 'In Stock' and yet the system didn't let the customer put in an order. Additionally, I needed to hedge on when an item would be 'Back In Stock' -- more on this in a later post!

Critically, I also went through development cycle with leadership and other trusted Sales team members to make sure what I made would match their needs. They would see an early draft, give feedback and I would fix that before starting the cycle again. This is again where empathy came into play - I needed to understand their problem and make sure what I created actually solved it, rather than assuming.

In Conclusion

Analysts provide massive value by identifying and solving ambiguous data problems. I learned that early on with this Sales Catalog example. I liberally applied each of the four soft skills to go from problem identification to problem solution:

Curiosity: Dug into what the reps were trying to do and what problems their existing solution created

Accuracy: Determined the tolerance of "Good Enough" data, both due to database limitations and hedging where I would prefer inaccuracy to rest

Organization: Broke down the problem into sub problems, which built into my final solution

Empathy: From start to finish, I made sure to listen to many voices across the team - both in understanding the problem and making sure my solution actually made their work lives easier

This same cycle has served me well time and time again in an analytics career. If you can proactively discover analytics problems and solve the important ones, you'll quickly provide value to any company lucky enough to have you aboard.

Categories
analytics bi businessintelligence everythinganalytics

4 Soft Skills to Amplify your Analytics Career

Soft Skills

I think Nate really said it best with "Technical ability is overrated." When I'm looking to make a hire as a manager, there are four skills that I'm looking for when I'm interviewing and continually assessing for my reports.

  • Curiosity - a childlike ability to keep asking, "Why?"
  • Accuracy - balancing perfect vs. "close enough" for your data
  • Organization - ability to break down tasks into small chunks and reliability execute on them
  • Empathy - actively listening & seeking to understand, and communication centered on your audience

I'll breakdown each of these with characteristics with an example, an interview question testing for that skill, and a way that you can improve in each of these areas.

Curiosity

The first soft-skill to have in your repertoire is Curiosity. This often means you are always asking questions and aren't afraid of asking them. I find myself often hedging a bit in this area by saying something like, "Sorry to be dense about this, but can you explain?"

When I'm dealing with new subject areas, this often will mean pausing conversations to understand words and what they mean. "Net Sales" often means something different in the sales organization vs. the finance organization, so getting to certainty on terms is critical. Frequently, digging into these types of questions can get uncomfortable, especially if the person asking you for help doesn't know the answer or can't define it well.

In interviews, there are a couple of ways to get at this skill. One way is to probe about problem-solving: identifying and solving tough problems. This is a bit open-ended, so making sure to redirect the question to your underlying objective is advised. Another way would be to model the behavior and assess how the candidate handles it. A curious person should be able to match your energy and get excited at the premise of jumping down the rabbit hole on a specific subject.

Not everyone comes by this skill intuitively, and for those of you in that bucket, there are some great frameworks to unlock a curious mind. My favorite comes from Sakichi Toyoda, of Toyota fame, and is called the "5 Whys". More on this below, from Wikipedia.

The key is to encourage the trouble-shooter to avoid assumptions and logic traps and instead trace the chain of causality in direct increments from the effect through any layers of abstraction to a root cause that still has some connection to the original problem.

https://en.wikipedia.org/wiki/Five_whys

Accuracy

Up next, we have accuracy. As a data analyst, it is critical to be right an overwhelming majority of the time. You don't need to be perfect, and in fact, perfect is the enemy of good. This can be tricky to do well because, as an analyst, you usually are the least knowledgeable person in the domain of the problem at hand. Getting accurate goes hand-in-hand with Curiosity because you must constantly bring assumptions to light. There is a lot of digging to do.

One particularly thorny problem to deal with is sales data within a CRM. It is highly speculative. It changes frequently. In short, it is unreliable. I have found working with this type of data greatly benefits from a common snapshotting period. Just take a backup every Friday at 5 pm. Do your analysis on a static copy, and figure out how to surface changes to key data fields (like close date or opportunity size). This allows you to bring accuracy to constantly shifting data sets.

Part-and-parcel with my comment above, in interviews, I like to probe around "soft data" and see how the prospective analyst has added certainty when data is vague or unreliable. For analysts with finance experience, asking questions about how they dealt with financial periods and month-end processes since those can bottleneck key metrics (like revenue). In that same vein, asking how, in detail, key metrics were calculated will shed some light on their accuracy approach.

Part of why I favor people with accounting backgrounds coming into analytics roles is that accuracy comes with the package. This learned through long hours of grinding out Excel spreadsheets and cross footing numbers one, two, three times. Thankfully, there some tricks to getting better at Accuracy, and it comes along with Curiosity. My favorite is applying a simple checksum technique and ensuring my source & target are equal on an aggregate basis. If it's wrong - take time to dig into why and understand why what you thought was correct isn’t. The second trick is a checklist, especially for common tasks (say, deploying code to production). Taking time to document exactly how something is done not only reduces the cognitive load for the next time but improves the quality of your work.

Organization

To me, being organized is not about a tidy desk. It is about being to tackle a problem from beginning to end. It means being able to cut through ambiguity and deliver something excellent. An underrated thing about being organized is the skill of breaking down big, hairy problems into small, actionable next steps. To me, an organized person always knows what to do next.

One example where I think Organization comes into play is the meetings you take with your stakeholders. I always try to make sure to recap actions at the end, as a habit to do every meeting that I am in. If I have actions, I’ll write them down but leave others to track their own action.

In an interview, I’ll test for this skill by probing about systems of work. “How do you organize yourself?” or “How do you know what to do next?” are questions that can get to the heart of this. For an analyst, this "system of work" is at the heart of getting more interesting work. Shipping early and often is critical to getting into the more interesting bits of work, so building your own work system is critical.

Getting better at Organization can come in many forms, but looking back on my career, this didn’t come naturally to me. I recall missing a key deadline for some sales analysis for my CSMO. He asked me why I didn’t have what he needed, and I replied, “I was busy.” His response, which put the fear of God into me, was “we are all busy” as he rolled his eyes. Needless to say, I went and picked up David Allen’s Getting Things Done. If you don’t have a work system, GTD is a great starting point, and I still use bits & pieces every day.

Empathy

You want to seek to understand and always bring empathy to the conversation with your stakeholders. They are taking the time to teach you about their business and its problems, so be a respectful skeptic. Remember, you want these folks to back to you the next time they have a problem! Empathy also means centering your communication on your stakeholders, so they feel heard even when you are giving bad news.

One thing I did as I got more comfortable in my analyst role was “rounding” with key stakeholders. This would mean making space to get coffee, drop-in late afternoon as things are winding down, and occasional lunches. These conversations were often more personal than business, but by the time conversation turned to work, we were both comfortable and ready to listen to each other. At one point, I had a couch in my office, and we joked that people would come by for therapy1.

Testing for empathy in an interview is a bit of a challenge. For me, I try to observe if they are listening versus waiting for their time to speak. I can go on for a bit too long at times, so when I catch myself wandering, I also casually check for body language in the candidate to see if they are really listening. Ultimately, this is one of the hardest skills to judge in an interview for me. But I’m actively trying to find ways to measure this quickly and accurately.

As an analyst, you are pretty sharp and usually have a good idea of how to solve a problem as soon as you hear it. To be more empathetic, slow down. Fall in love with the problem. You need to see the problem clearly enough that you can come up with a solution that exceeds your stakeholder’s expectations. Repeat back what you think you heard. This is especially important if the person across the table from you is from a different background, as cultural context can get in the way of great communication. People should always leave a meeting with you feeling like they were listened to.

Final Notes

With Curiosity, Accuracy, Organization, & Empathy, you can be a great analyst. These characteristics all build on each other and help you build a reputation as a reliable, skillful person who can deliver business value. People will seek you as the analyst to solve their problems. Yes - it's great if you write some SQL, python, or R, but these soft skills will allow you to be 10x greater than someone much stronger technically. After all, I truly believe that success as an analyst should be measured by is how they enable the people around them. A great analyst doesn’t 10x themselves; they 2x (or more!) everyone around them.

—-

1This is REALLY HARD to do in a remote environment. No idea how to replicate this digitally but I’m sure there is a way...

Categories
analytics bi businessintelligence everythinganalytics

Technical Ability is Overrated

Previous Entries in Everything Analytics

(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles

Tunnel Vision on Technical Ability

If you were to ask someone "What skills are the hallmark of a data analyst?" the answers consistently center around technical ability: SQL, Python, R, Tableau, Power BI. The same shows up on most job postings - technical ability listed first.

That means it's unsurprising when aspirational analysts focus heavily on "What technical skills / certifications do I need to be competitive for an open position?" To hammer the point home, I took a look at the Weekly Entering & Transitioning post at the Data Science subreddit. While not Data Analyst specific, there is a ton of overlap between people interested in Data Science and Data Analytics. Here are some excerpts:

"How are entry level prospects for someone with a bachelors in data science?"

"The main concern is that I don't have any basic knowledge in any C language."

"I have been teaching myself SQL/Python/HTML through CodeCademy pro"

This repeats week after week after week - never ending inquiries about the technical side of the job. In the words of Morpheus - what if I were to tell you...that technical ability will not win you an analytics job? This has held true both for me getting into analytics jobs, as well as interviewing many others for analytics positions.

There's far more to a well-rounded Data Analyst, as someone in that same Reddit thread rightly identified: "While it’s easy to find resources to learn technical/mathematical skills, which I have been doing. Are there any resources for practising problem solving in the context of data analysis"

The Two Axes of an Analyst

Below is a quadrant depiction of how analysts are assessed in interviews and in their day-to-day. "Technical Ability" isn't listed here.

Data Analyst Skillset Quadrant

Don't get me wrong - technical ability is absolutely important. If you have no technical ability you'll struggle to get the data you need to do your job.

But technical ability is just a means to an end. And it's the most teachable type of skill out there! Even if there's a gap, it's easy to overcome with training. Business Acumen and Soft Skills are much more difficult to uplevel. I learned this lesson firsthand:

The Smartsheet Director of BI interviewed me three years ago for a Senior Analyst position. At the end of a 45 minute discussion, I realized I hadn't been asked a single technical question. Not one check for SQL, or Python, or Tableau skill. So I asked, "Why didn't you discuss my technical ability? Are you just trusting I know my stuff?" The director sat back, chuckled, and replied, "I only need to know how you think -- if you have technical gaps we can fill those quickly."

Technical Ability as a Multiplier

So, what place does technical ability have if it isn't what analysts are measured on? It's a multiplier - a 21st century career rocket fuel.

There are countless business leaders who have excellent acumen and soft skills. The C-Suites and corner offices are filled with those individuals.

As a data analyst you leverage technical ability to multiply how well you apply your soft skills and business acumen. Suddenly you'll find yourself at tables you otherwise would never have seen, discussing critical business questions with C-Level individuals. Finding patterns in data requires technical ability, and data-driven stories are phenomenally powerful when wielded with strong soft skills.

In Conclusion

There is a massive focus on technical ability when really that's just a multiplier for the core skillsets a data analyst brings to the table. As you read in last week's post, Data Analysts help the business make better decisions leveraging data. That involves connecting the data to business problems utilizing Business Acumen and effectively/persuasively communicating findings with Soft Skills.

Don't just take my word for it - perhaps at this point you're wondering "What are these soft skills and how do I develop them?". Jacob has just the post for you - see 4 Soft Skills to Amplify Your Analytics Career.