Categories
analytics analyticsengineering bi businessintelligence excel Uncategorized

Connect Snowflake to Excel in Minutes

Data “Self-Serve” is a buzzword that’s managed to stick around for a long time without a solution. However, I’m convinced that we can get partway there with simple data products rooted in familiar tools. One ubiquitious tool? Excel. Nearly everyone uses spreadsheets or similar productivity tools at work.

That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.

Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.

Step by Step Instructions

(1) Install the ODBC Driver

Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.

(2) Configure ODBC Driver

Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)

Under User DSN, select Add…

Select SnowflakeDSIIDriver from the menu

Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.

Note: I found that lots of databases & schemas are available even after choosing some here. Not sure the full limitations, so you can play with options. I put all options in for the primary database I cared about and it worked fine.

Click on Test… to confirm it worked. Here’s the dialog if it did:

(3) Connect to database in Excel

Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC

From the window that pops up, pick the Snowflake connection and select OK

If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.

IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.

Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”

The next menu that pops up will give you various options – pick the second one down saying PivotTable Report

DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.

Parting notes

There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.

(1) Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.

(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).

That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.

Categories
analytics excel retro

NBA Bubble Sim: A Retrospective

One thing that I really enjoy as an analyst is creating new models – and expanding them. I made a version of the Bubble sim with 1m+ scenarios, for example (that will turn into a blog post here at some point). But I rarely maintain the focus or energy to take a look at it after the fact to determine “how good was it at actual predicting the future?”1 I’m aiming to change that with this real-life example of this NBA model. So with that said, let’s dive in.

Predicting individual games

Using ELO to predict individual games should theoretically massively improve the predictive ability of the model versus, say, coin flips. However, as we will see, that was really not the case.

quality of prediction for individual games

Ultimately, we were just slightly better than coin flips. Sort of disappointing if I’m honest. I do think there is some context that ELO is particularly bad at explaining, which we can distill into the statement “ELO overstates the relative strength of teams that have clinched a playoff birth.”

I’ll dive into this at the end, as I think some faulty modeling by the NBA around this assumption lead to some crappy basketball being played.

Predicting which teams made playoffs

When I look at the 1000 scenarios in aggregate (instead of a game by game basis), a much clearer picture of the model and its effectiveness is painted.

quality of prediction for making playoffs

Looks pretty good! A damn good model. HOWEVER – given that for all intents & purposes, 15 out of 16 playoff spots were guaranteed, this really is a false narrative about the effectiveness of the model.

Reducing scope to measure uncertain outcomes

For the purpose of this analysis, I will take a look at the quality of the model as it relates to 3 teams – the New Orleans Pelicans (NOP), the Memphis Grizzlies (MEM), and the Portland Trailblazers (POR). This is because these are the 3 teams competing for the final playoff spot, so by getting better at predicting these teams, we improve the efficacy of the entire model.

predicting outcomes for POR, MEM & NOP

I can’t say these updated stats are particularly great. We are more accurate here than we were for predicting specific games, but far from some certain enough to do something like gamble on this model reliably. Even knowing what we did going into the NBA bubble, Portland, who ultimately made the playoffs, only had a 29% chance to make the playoffs.

Incorporating some modifications

One obvious observation as the bubble games continued was that “ELO overstated the relative strength of teams that have clinched a playoff birth.” With this knowledge, I started tweaking my model to accommodate this new information. Ultimately what I landed on was to reduce the ELO for teams that have already clinched by 20%. This number is totally arbitrary and based on gut feel. I also assumed the eastern conference was de-facto clinched based on the players who opted out or were injured for the Wizards.

Given the relatively poor performance of the model, I was seeking to explain the following data points:

  • The Bucks & Lakers were playing very poorly.
  • The Suns & Blazers looked unstoppable.

With the modification of the model to reduce ELO for qualified teams by 20%, the new playoff odds looked like this:

playoff odds with ELO reduction for clinched teams

Of course, simply buffing Portland’s playoff odds massively increases the accuracy of the prediction, so this might be a bit too reductionist. Furthermore, with some clever configuration of Excel to leverage the solver, the exact handicap percentage could be tweaked to maximize the odds of Portland making to playoffs.2 That being said, let’s take a look at how model quality changes with this change:

prediction quality post adjustment

This is MUCH better. Obviously, the updated model has the benefit of some hindsight here. But a small, targeted change the model was able to increase accuracy from 54.7% to 69.2%. Precision & recall increased by similar margins. I think there is something here that can be applied to future models of NBA outcomes.

Conclusion

Overall, I am satisfied with the outcomes of this process of exploring the model in the context of the metrics above. The key learning for me is that certainty of outcomes does impact the quality of play, at least in the NBA bubble. After accounting for that, we were able to increase model accuracy by more than 25%. To get more accurate, my analysis would need to be more surgical in approach.

My biggest take-away is that I will be designing future models to enable rapid analysis using the metrics here-in. I didn’t do that in this case as I didn’t account for actually doing this analysis. Having appropriate consideration for accuracy testing in the front end would have meant I could have backtested assumptions and model changes across a much broader data set. As a result, I didn’t have an easy way to test my updated assumption of the 20% ELO discount down at the game level. I’m certain that applying better science techniques could result in an even higher accuracy model.

I do find it super interesting that there was a huge miss on the New Orleans Pelicans performance vis-a-vis their ELO rating. This entire process was arguably designed to maximize the odds of the Pelicans (& Zion) to make the playoffs, and in that regard, the NBA’s experiment failed completely. Conversely, one thing that could have been anticipated based on the 20% ELO handicap is that the Phoenix Suns had around a 35% chance to get 7 or 8 wins. Given that, it probably would have made more sense for the NBA to open a mini-tournament at the bottom of the bracket for 7/8/9/10. It would have increased the quality of play and led to a more exciting finish to the end of the regular season. And I think NBA, who certainly has modelers far more sophisticated than I, should have anticipated the drop in play associated with teams who have already clinched.

footnotes

1I’m using the assessment framework found here on towardsdatascience.com, for accuracy, precision, true positive rate, sensitivity, and F1 score. You can find the definitions within that link – it’s worth the read.

2After writing this, I did some excel tweaking to allow the solver to optimize the handicap for clinched teams. It was 20.00001%. Bizarre.