Previous Related Posts:
(1) The Many Wandering Paths to Analytics
(2) Landing Your First Analytics Job
(3) Defining Analytics Titles
(4) Technical Ability is Overrated
(5) 4 Soft Skills to Amplify your Analytics Career
(6) Case Study: Solving an Ambiguous Problem
Technical Ability Is OverratedMe, two weeks ago
This is meant as a companion post and reply to the most common response to Technical Ability Is Overrated. Specifically, “Analysts need technical ability to do their job, and that means it’s important.” I wholeheartedly agree – and while you can’t win a job on technical ability alone, you certainly may lose it.
That means I would be remiss to cover the basic SQL concepts which will put you in a good place in most any Analyst interview. You need to know SQL to apply your business acumen and soft skills and that’s why it’s consistently tested in interviews. Generally if you know the SQL basics, the hiring team will be confident you can refresh/learn any knowledge gaps later on.
However — if you find positions that would disqualify you if you didn’t know something outside of these concepts, that should be a red flag. Those organizations focus too heavily on technical ability and/or the position is more in line with a Data Engineer than a Data Analyst.
Note: I am writing this using Snowflake SQL syntax; there are variations and quirks to each SQL version so some of this may be close but not exact to the environment you are in.
Where to brush up on SQL skills
There are tons of great SQL learning resources online now. One of the best out there is SQLZoo, with great examples and the chance to practice writing SQL to check your syntax. W3Schools also has a great set of tutorials for all sorts of SQL queries.
(0) Demonstrate Previous SQL Work
OK OK, this isn’t one of the four. But if you can demonstrate non-classroom SQL ability either through previous work or on the potential job’s take-home exercise, that’s worth its weight in gold. Mostly people want to know that you can use SQL to solve problems and if you can speak to using complex SQL to get stuff done previously, that goes a long way in checking the box
(1) Left/Right/Inner/Outer Join
These joins are the bread and butter of the SQL world – especially Left and Inner. You need to be able to explain the difference between each quickly and succinctly, as well as pick out which to use if/when tested. I’ll add a quick visual & code example of each borrowed from W3Schools. Read the in-depth explanations at W3Schools or SQLZoo for more details.
SELECT Customers.CustomerName, Orders.OrderID FROM Customers LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID ORDER BY Customers.CustomerName
SELECT Orders.OrderID, Employees.LastName, Employees.FirstName FROM Orders RIGHT JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID ORDER BY Orders.OrderID
SELECT Orders.OrderID, Customers.CustomerName FROM Orders INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID
SELECT Customers.CustomerName, Orders.OrderID FROM Customers FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID ORDER BY Customers.CustomerName
(2) Aggregate Functions
Basic SQL functions involve returning a set of rows that match some criteria (“Show me all transactions from yesterday”). Sometimes, you’ll need to aggregate your data to answer summary questions (“Show me how many transactions we’ve had by day this year”). This is true for situations where you need to count, sum, average or find min/max.
Key Concept 1: Understand query level of detail
If you need to sum up every transaction from a single store, that’s easy – the level of detail is at the day + store level. However, if you want to see how many of each product was sold at each store on each day, suddenly you have three levels of detail (product + day + store). The more details someone needs in the data, the less aggregated it’ll get. Seems obvious but undoubtedly you’ll run into funky results when you THINK you understood the aggregation level but did not understand in reality.
Key Concept 2: Leverage GROUP BY
You need to tell SQL the level of detail in order for it to know how to aggregate your data. It won’t read your mind and if you leave any ambiguity, it’ll either fail to process (which is fine – at least you know) OR it will aggregate a ‘wrong’ answer (which is bad, you might not catch this!)
For example, let’s say you want that store data. If you don’t use Group By you’ll type it out something like:
//Example - this will fail SELECT order_date, count(transaction_id) FROM transactions
However that’ll return an error – SQL doesn’t see a GROUP BY statement and will tell you as much. It doesn’t know what to do with “order_date” and that ambiguity makes the query fail. That means you need to add in one more line to make it execute:
//Example - Show count of transactions by date SELECT order_date, count(transaction_id) FROM transactions GROUP BY order_date
Key Concept 3: Filter aggregations with HAVING
Let’s say someone wants to only see days where there were at least 100 transactions. If you aren’t familiar with aggregations, you might write something like this:
//Example - this will fail due to using WHERE instead of HAVING SELECT order_date, count(transaction_id) FROM transactions WHERE count(transaction_id) >= 100 GROUP BY order_date
SQL will throw you an error. The WHERE clause is to filter out individual rows – except the count() function looks at multiple rows at once. It won’t know what to do! Someone doing a SQL test will see if you pick up on this when whiteboarding a problem – this is a common ‘gotcha’ question.
SQL provides the HAVING clause to allow you to filter on an aggregated column, like so:
//Example - Dates with at least 100 transactions SELECT order_date, count(transaction_id) as "TRANSACTIONS" FROM transactions GROUP BY order_date HAVING count(transaction_id) >= 100
Notice that HAVING comes after GROUP BY – while not absolutely critical to remember the order of these, it’s a ‘nice to have’ if you can on the fly remember the order in which SQL clauses execute.
This is typically the upper limit of SQL testing for Data Analyst jobs. Commonly you will need to use a subquery to pull in data into the SELECT, FROM or WHERE clauses. SQL is a very flexible language and you can use a subquery to define a secondary table with its own select/from/where logic that is separate from your main query.
For instance, let’s say that we want to adjust our query from part (2) and now we want an additional filter – a list of all days with at least 100 transactions AND at least one customer was from California. It may seem simple at first, but it turns out this requires a subquery.
Let’s say you initially try just adding
//Example - Initial attempt, adding in Customer table and... //...adding in a filter for customers from California SELECT order_date, count(transaction_id) as "TRANSACTIONS" FROM transactions t LEFT JOIN customer c on c.customer_id = t.customer_id WHERE c.customer_state = 'CA' GROUP BY order_date HAVING count(transaction_id) >= 100
The above will return a result but it’ll be wrong. When you put the California filter in, the SQL script filters down to only rows with California customers. All other sales are removed. This means your numbers come out very low. In fact, now your query is returning a list of all days with at least 100 transactions solely from customers in California.
So, how do you use Calfornia as a filter in the aggregate without having it be part of your base query? This is where a subquery comes in. Let’s rewrite this as a sub-query in the LEFT JOIN statement and THEN use it in a where statement.
//Example - Add Subquery into WHERE clause SELECT order_date, count(transaction_id) as "TRANSACTIONS" FROM transactions t //Add in Subquery into WHERE clause WHERE t.order_date IN (SELECT distinct(t2.order_date) FROM transactions t2 INNER JOIN customer c on c.customer_id = t2.customer_id AND c.customer_state = 'CA' ) GROUP BY order_date HAVING count(transaction_id) >= 100
There are two crucial pieces here:
(1) I created a subquery finding days with sales to California
(2) I put that subquery into the WHERE clause to filter to those days
This is the flexibility of subqueries – I got to use a different level of detail to filter my base query. This is one of the most complex concepts you may be tested on. I considered subqueries to be right at the edge of “Expected” and “Nice to Have” and that line can be blurry elsewhere – so preparation is key here.
Common Table Expressions (CTEs) are becoming more and more popular. In fact, so popular that I’ve shunted the more complex subqueries in favor for CTEs. They essentially allow you to make something that acts and feels like a table, but only exists for as long as your query lasts. It is extra readable (think of how complex some subqueries can get in a long set of code!), and that readability is what makes it so powerful.
While I won’t cover the comparison fully today, I’ll save this discussion for a future blog.
If you come to an interview with examples of previous SQL work and/or a knowledge of Joins/Aggregate Functions/Subqueries, you’ll do fine on the technical assessment for most roles. This is the basic toolset needed for analysts to get the job done and allows you to leverage your business acumen and soft skills.
Duelers’s Note: Jacob here. I’ve found it incredibly useful to keep a book around for reference purposes that I can dog-ear, highlight, and otherwise markup. Since I’m primarily in the MS stack, I heartily recommend “T-SQL Fundamentals” by Ben-Gan Itzik. There are great references for other SQL flavors too – but you will need to do your own research to find them.