data viz Archives – DataDuel.co

If spreadsheets are eternal, are BI tools transitory?

A few months ago, Benn Stancil wrote about the eternal spreadsheet. While I appreciated the generous shout out to both mdsinabox and motherduck (my employer), this really got the wheels turning around something that I have been feeling but have only been able to put into words recently: are BI tools transitory?

Consider the following scenario in the Microsoft stack: ingest data with ADF, transform with Fabric (& maybe dbt?), build a semantic model in Power BI, and delicately craft an artisanal dashboard (with your mouse). Then your stakeholder takes a look at your dashboard, navigates to the top left corner and clicks “Analyze In Excel”. How did we get here?

I remember back in the 90s, hearing my dad talk about “killer apps”. The killer app was the app that made your whole platform work. If you wanted your platform to be adopted, it needed a killer app so good that users would switch their hardware and software (an expensive proposition at the time) so they could use the killer app. In my lifetime, I recall a few killer apps: The App Store (iOS), the web (Internet), and Spreadsheets (personal computing).

Spreadsheets allowed a user to make an update in one place and for the data to flow to another place. If this concept seems oddly like a directed-acyclic graph (DAG), that’s because it is. These spreadsheets contain a bunch of features that we find handy in the data solutions stack today: snapshots (save as), version control (file naming), sharing (network shares & email attachments), business intelligence (pivot tables & charts), file interoperability (reading csv, json, xml etc), transformation (power query (there was an earlier, even more cursed version too)). All of these pieces have obvious metaphors in the commonly used data stacks today. Critically, one piece is missing: orchestration (note: back in mid 2010s, I used and loved an excel plugin called “Jet Reports” that included an orchestrator, among other things).
Now if you were running a business in the 90s (like these guys), there was no need for orchestration in your spreadsheet. You, the business user, were the orchestrator. Your data came from many places – memos (later, emails), research (books, later pdfs), a filing cabinet (later, databases), phone calls (later, slack), meetings (later, zoom calls), and your own synthesis (later, chatGPT (just kidding)). Software could not contain these! We did not have the digital twins for these analog processes. In some ways, the spreadsheet was the perfect digital collection point for these physical artifacts.

As each of these parts of our business decision making input processes transitioned to digital, our poor spreadsheet began to fall out of favor. We replaced memos with emails, phone calls with IM (via skype, if you are old enough to remember), and so on. And these digital processes began to produce loads of data. Every step produced an event that was stored in a database. The pace of change in the business environment increased in-kind. Our once per month spreadsheets orchestrated by humans were a bit too slow, processes produced too much data to be aggregated by humans. I fondly recall the launch of excel 2007, which included a new architecture and file format, so that we could process one million rows instead of only 65,536.

Unfortunately, the hardware at the time could not actually handle one million rows. Every single person using excel, unencumbered by 32bit row limits, ran into the hard limits of the Excel architecture and inevitably seeing a spinning, “waiting for excel” icon before crashing (hopefully you saved recently). Hilariously, Microsoft trained users to tolerate an absolutely terrible experience. Excel could do too much. What we needed to do was unbundle this tool, take it apart piece-by-piece, so that we could have good, delightful experiences for our business users. Users could still use spreadsheets for things, but we needed to shift the load bearing intelligence of our business decision making into better tools.

So we built even more powerful databases, and ways to automate decision making at scale. We began to use multiple computers, running in parallel, to solve these problems for us. Large complex systems like Hadoop were required to aggregate all this data. Companies like Google harnessed the immense scale enabled by these systems to become the largest in the world, building never-before-seen products and experiences.

At the same time, CPU clock speeds stopped increasing. We had maxed the number of cycles we could push out of the silicon in the fabs. But innovation found a way to continue – we began to add more cores. Slowly but surely Moore’s law kept on holding, not on clock speed but on throughput.

The software built to take advantage of the scale that was possible with huge quantities of networked computers made assumptions about how to work at great scale across many machines (i.e. Spark). These assumptions did not generalize to single machines with many cores. This has not been unnoticed, of course (see George’s tweet).

So what happened to our business intelligence while this was going on? The number of tools exploded, while the consumption interface remained unchanged. Categories were split into sub-categories into sub-categories. We only had so many charting primitives, and whether we dragged and dropped with Tableau or used BI as code in Evidence, the output looked largely the same. But instead of one tool that we needed in the 90s, we now had thousands.

But I would argue we haven’t added anything new, we’ve merely unbundled it into a bunch of different products and that don’t work that great together. REST APIs have allowed scalable, loosely coupled systems but really suck to work with. Behind every large enterprise data workflow is an SFTP server with CSVs sitting on it (if you are lucky, its object storage and a compressed format, but its the same thing).

If we look at the trends, in 5 years we will have approx. 10x more compute than we do today, and Backblaze estimates that cost per GB of storage will stabilize around 0.01 / GB ($10/TB). If these trends hold, we will easily have enough horsepower on our laptops to put all these pieces that we have decoupled over time, into one box. If BI tools are transitory, spreadsheets are eternal. The era of spreadsheets 2.0 will be upon us.

What are the characteristics of Spreadsheets 2.0?

Runs on a single node with many cores (hundreds?)
One file format for handling all types of data (xlsx++)
One language for end-to-end data manipulation (sql)
A spreadsheet UI for interacting with data at any step in the data manipulation (power query-ish)
Fast, interactive charting (mosaic)
Intelligent, incemental orchestration (dynamic dags)
An AI trained on all these parts above to assist the user in documentation, understanding, and building (clippy but good)

I believe the first building block of this has emerged in front of our eyes: DuckDB. The hardware is being built as we speak (the feedback loop will build it whether we like it or not). Julian Hyde is advocating for “completing the SQL spec to handle metrics” (with apologies to Malloy) – humans have refined this language over the last 50 years and will continue to do it for the next 50. We already have the UI primitives (Excel), so we merely need to bolt these together.

It’s time for the humble spreadsheet to RETVRN. It’s time to bring humans back into the workflow, empowered by AI, to own the data ingestion, transformation, and synthesis required to make decisions. Of course, I’m tinkering with this idea today, if you are interested in what I have so far, please reach out, I would love to talk.

Connect Snowflake to Excel in Minutes

Data “Self-Serve” is a buzzword that’s managed to stick around for a long time without a solution. However, I’m convinced that we can get partway there with simple data products rooted in familiar tools. One ubiquitious tool? Excel. Nearly everyone uses spreadsheets or similar productivity tools at work.

That leads me to meet stakeholders where they’re at: in Excel. And modern data warehouses like Snowflake make it really easy to do so. It’s an easy win if you’ve invested in Analytics Engineering to create clean datasets in your database. Let’s bring those datasets to your users.

Here’s how to connect Snowflake into Excel and enable live connections pivot tables in minutes. These are instructions for Windows specifically.

Step by Step Instructions

(1) Install the ODBC Driver

Click on the “Help” button in the Snowflake UI, go to “Download…” and select “ODBC Driver” and “Snowflake Repository”. Install from the file that downloads.

(2) Configure ODBC Driver

Go to your start menu and type in “ODBC” and click on ODBC Data Sources (64 bit)

Under User DSN, select Add…

Select SnowflakeDSIIDriver from the menu

Fill in the boxes as follows – though your individual situation may vary. My example uses SSO when an organization doesn’t allow direct usernames/passwords for Snowflake. Lots of options here and Snowflake has full documentation of options here.

Note: I found that lots of databases & schemas are available even after choosing some here. Not sure the full limitations, so you can play with options. I put all options in for the primary database I cared about and it worked fine.

Click on Test… to confirm it worked. Here’s the dialog if it did:

(3) Connect to database in Excel

Open Excel and go to the Data tab, click on Get Data and choose From Other Sources and pick From ODBC

From the window that pops up, pick the Snowflake connection and select OK

If successful, you’ll see a window with a dropdown showing your available databases. Use that dropdown to pick the database you want.

IMPORTANT: There is an easy way to load data directly into a Pivot Table at this point (thanks Jacob for this tip!) which will save you and teams time.

Once you select the database / schema / table you want, go to that “Load” button on the bottom and click the little down arrow next to it. Choose “Load to…”

The next menu that pops up will give you various options – pick the second one down saying PivotTable Report

DONE. You’re there. The data is now connected live to Snowflake and is available to pivot. I used Snowflake’s sample “Weather” table which I just learned has basically nothing in it, but that’s besides the point.

Parting notes

There are a couple interesting tidbits to pass both to your stakeholders as well as anyone concerned about Snowflake compute cost & data security.

(1) Stakeholders can refresh data live from Snowflake any time. By right-clicking the pivot table and selecting “Refresh”. No more stakeholders asking you for the latest data – they can just get it anytime.

(2) Data is cached on the local machine, reducing compute costs & keeping things snappy for stakeholders. This satisfies worries from both stakeholders on performance (it’s REALLY snappy, even for huge tables) as well as those concerned on cost (compute only happens on refresh).

That’s it! Just a few installations and clicks and you’ve connected Snowflake live into Excel for any stakeholder. Happy self-serving.

Write Code Last – 4 steps to better dashboards

I gave a talk last week about “Data to Dashboard” and I wanted to share it here, too. There is a lot of discussion in the analytics space about dashboards and how to make them look good but less about how to get to that point. This is my take on the subject – I hope you enjoy it.

Toronto Data Workshop – 6/18/2021

NBA Bubble Sim: A Retrospective

One thing that I really enjoy as an analyst is creating new models – and expanding them. I made a version of the Bubble sim with 1m+ scenarios, for example (that will turn into a blog post here at some point). But I rarely maintain the focus or energy to take a look at it after the fact to determine “how good was it at actual predicting the future?”¹ I’m aiming to change that with this real-life example of this NBA model. So with that said, let’s dive in.

Predicting individual games

Using ELO to predict individual games should theoretically massively improve the predictive ability of the model versus, say, coin flips. However, as we will see, that was really not the case.

quality of prediction for individual games

Ultimately, we were just slightly better than coin flips. Sort of disappointing if I’m honest. I do think there is some context that ELO is particularly bad at explaining, which we can distill into the statement “ELO overstates the relative strength of teams that have clinched a playoff birth.”

I’ll dive into this at the end, as I think some faulty modeling by the NBA around this assumption lead to some crappy basketball being played.

Predicting which teams made playoffs

When I look at the 1000 scenarios in aggregate (instead of a game by game basis), a much clearer picture of the model and its effectiveness is painted.

quality of prediction for making playoffs

Looks pretty good! A damn good model. HOWEVER – given that for all intents & purposes, 15 out of 16 playoff spots were guaranteed, this really is a false narrative about the effectiveness of the model.

Reducing scope to measure uncertain outcomes

For the purpose of this analysis, I will take a look at the quality of the model as it relates to 3 teams – the New Orleans Pelicans (NOP), the Memphis Grizzlies (MEM), and the Portland Trailblazers (POR). This is because these are the 3 teams competing for the final playoff spot, so by getting better at predicting these teams, we improve the efficacy of the entire model.

I can’t say these updated stats are particularly great. We are more accurate here than we were for predicting specific games, but far from some certain enough to do something like gamble on this model reliably. Even knowing what we did going into the NBA bubble, Portland, who ultimately made the playoffs, only had a 29% chance to make the playoffs.

Incorporating some modifications

One obvious observation as the bubble games continued was that “ELO overstated the relative strength of teams that have clinched a playoff birth.” With this knowledge, I started tweaking my model to accommodate this new information. Ultimately what I landed on was to reduce the ELO for teams that have already clinched by 20%. This number is totally arbitrary and based on gut feel. I also assumed the eastern conference was de-facto clinched based on the players who opted out or were injured for the Wizards.

Given the relatively poor performance of the model, I was seeking to explain the following data points:

The Bucks & Lakers were playing very poorly.
The Suns & Blazers looked unstoppable.

With the modification of the model to reduce ELO for qualified teams by 20%, the new playoff odds looked like this:

playoff odds with ELO reduction for clinched teams

Of course, simply buffing Portland’s playoff odds massively increases the accuracy of the prediction, so this might be a bit too reductionist. Furthermore, with some clever configuration of Excel to leverage the solver, the exact handicap percentage could be tweaked to maximize the odds of Portland making to playoffs.² That being said, let’s take a look at how model quality changes with this change:

This is MUCH better. Obviously, the updated model has the benefit of some hindsight here. But a small, targeted change the model was able to increase accuracy from 54.7% to 69.2%. Precision & recall increased by similar margins. I think there is something here that can be applied to future models of NBA outcomes.

Conclusion

Overall, I am satisfied with the outcomes of this process of exploring the model in the context of the metrics above. The key learning for me is that certainty of outcomes does impact the quality of play, at least in the NBA bubble. After accounting for that, we were able to increase model accuracy by more than 25%. To get more accurate, my analysis would need to be more surgical in approach.

My biggest take-away is that I will be designing future models to enable rapid analysis using the metrics here-in. I didn’t do that in this case as I didn’t account for actually doing this analysis. Having appropriate consideration for accuracy testing in the front end would have meant I could have backtested assumptions and model changes across a much broader data set. As a result, I didn’t have an easy way to test my updated assumption of the 20% ELO discount down at the game level. I’m certain that applying better science techniques could result in an even higher accuracy model.

I do find it super interesting that there was a huge miss on the New Orleans Pelicans performance vis-a-vis their ELO rating. This entire process was arguably designed to maximize the odds of the Pelicans (& Zion) to make the playoffs, and in that regard, the NBA’s experiment failed completely. Conversely, one thing that could have been anticipated based on the 20% ELO handicap is that the Phoenix Suns had around a 35% chance to get 7 or 8 wins. Given that, it probably would have made more sense for the NBA to open a mini-tournament at the bottom of the bracket for 7/8/9/10. It would have increased the quality of play and led to a more exciting finish to the end of the regular season. And I think NBA, who certainly has modelers far more sophisticated than I, should have anticipated the drop in play associated with teams who have already clinched.

footnotes

¹I’m using the assessment framework found here on towardsdatascience.com, for accuracy, precision, true positive rate, sensitivity, and F1 score. You can find the definitions within that link – it’s worth the read.

²After writing this, I did some excel tweaking to allow the solver to optimize the handicap for clinched teams. It was 20.00001%. Bizarre.

Duel: NBA Bubble Projections

For our inaugural duel – Jacob created a data set based on 538’s NBA Predictions. He’ll create a deep dive into the mechanics of the model and how to leverage Excel’s data table function for no-code simulations in a later post. The data is available at the bottom of this post.

Jacob is a native Excel user and has created similar models for his Fantasy NBA league. He was able to take those models and dress them up for this duel – albeit in a format that was native to PowerBI and Excel. More on how that impacted the Tableau side of the analysis below.

Since we were using the 538 data set, we decided the first part of the challenge should be to replicate the view above in PowerBI & Tableau.

538 – Nba Preditions
PowerBI – NBA predictions
Tableau – NBA Predictions

Some of the data weren’t readily available, i.e. projected point differentials and team logos. For the purpose of the commentary below, we will be ignoring these facts.

As a phase 2 / stretch goal for this challenge, we also set out to create our own, novel visualization of the scenario combinations. This helped us to answer questions like “When the Bucks make the finals, who are their most likely opponents?” or “What are the paths for the Celtics to the conference finals?”.

Tableau Scenario Modeler
PowerBI Scenario Modeler

PowerBI Commentary

This is section is written in first-person by Jacob.

Part 1: 538’s visualization

Where PowerBI succeeded: Getting the calculations out of the attached data set was fairly easy once I sorted out the data model in my head. While the data wasn’t perfectly formed, it was quite easy to shape it using PowerQuery to get what I needed.

I added a couple of measures on top of it and the table working pretty quickly. Getting the conditional formatting to match was fairly easy too, although to get an exact color match I used the “color dropper” from powerpoint on a screenshot of the website (gross).

Where PowerBI struggled: I couldn’t quite figure out how to get the sorting to work when I replaced certain values with “icons”, i.e. >99% or the “checkmark” icon. PowerBI treats the field as a string and therefore does a character-based sort. This means that apply a single sort on the outcome of the model doesn’t really work! Instead, you have to sort by ELO rating or by Projected standings to get a cohesive sort.

After I wrote this initially – I did find a workaround for this sorting issue, sort of. This video from Guy In A Cube explains the “hack” – but it is indeed just a hack.

Part 2: Scenario Modeling

I am pretty satisfied with how this visual turned out – but the sorting on probability fields continued to plague me. Also, the mental model for this data was effectively recursive, and I am not sure how to accomplish this in PowerBI, so I imported the same table twice. See the image on the leftfor how this was accomplished.

After fighting with DAX on and off for a few days, I was able to get a “base scenario” calculation using the ALL Filter. This meant that when you selected a Team from “series_winners” you could calculate the odds of that scenario versus the “base” scenario. This surfaces really neat scenarios in the modeler, such as an OKC win in the second round which double’s Milwalkee’s championship odds.

You can find the DAX for stripping the filters from the “series winner” table, below.

All Scenario Win Pct = 
CALCULATE ( COUNT ( series_detail[TeamID] )/
        DISTINCTCOUNT ( series_winners[ScenarioID] ),
        ALL(series_winners)
    )

Tableau Commentary

This section is written in first-person by Nate.

Part 1: Data Prepping

Where Tableau succeeded: Tableau handled the data really well once I completed a lot of trial-and-error to get the data into the right format. The data model I put together involved two tables in there twice, so it’s likely sub-optimal but is functional. Specifically, Tableau is consistently improving how data can be loaded & prepared (See the recent changes just launched in 2020.2) but my unfamiliarity with those new features meant I didn’t have time to give them a go on this analysis.

Where Tableau struggled: My experience getting data into Tableau nearly always involves a connection to SQL – either a direct connection to a table or a very clean CSV output of a SQL query. Since the data model created by Jacob is in PowerBI/Excel, I had to do some manual adjustments to the tables to get them in the format I needed, such as creating long tables (just a few columns) out of wide tables. This resulted in several more hours of work as I did trial & error between modifying data and trying to visualize it in Tableau.

Part 2: 538’s visualization

Where Tableau succeeded: Getting this table created was very simple once I finalized the data model. Sorting works well across all columns and the Tableau method of dragging dimensions & measures around to get colors & formats worked well.

Where Tableau struggled: I could not get some of the nifty 538 features in the table – such as a checkmark at 100%, and a string for “<1%”. Instead, Tableau displays 0% for situations that round down to 0%. I tried adding in a decimal place, but that just cluttered up the view. As well, Tableau does not have strong conditional formatting capabilities for the background of cells. It’s possible (see the KB article here) but I found the saved development time for other work by instead just coloring the numbers and shifting to a darker cell background.

Part 3: Scenario Modeling

Where Tableau succeeded: After several rounds of arm wrestling, pleading and bribing – Tableau finally assented to the view I wanted which included:

(a) The original odds
(b) New odds based on selection
(c) Visualization demonstrating change in odds

There’s much more I wish I could have done but in the interest of time (the playoffs are going now!) it was time to hit ‘Publish’

Where Tableau struggled: I spent multiple hours trying to get the FIXED LOD formula in Tableau to work which would allow me to compare the odds from both the unfiltered view and the filtered scenario view at the same time. Tableau can be frustrating to use when trying to visualize multiple levels of detail in the same view, and likely my chosen data model contributed to the struggles.

I asked the best Tableau user I know for some advice as I was getting this viz prepared and his advice sums up the struggle well: “When dealing with LODs, I usually just try every permutation until something somehow works.” Turns out he was right in this instance, too.

Overall Winner: PowerBI

Category Winners:

538 Duplication: PowerBI – but really neither piece of software allows the customization that the web package used by 538 has. Still, we gave it a decent go. PowerBI does tables a bit better, so it wins here.

Scenario Modeling: While Tableau is very snappy and honestly more “discoverable” (good job with the tooltips, Nate), the PowerBI “tournament visual” is very intuitive for sports fans. Additionally, the analysis at the bottom of the chart is more comprehensive and more understandable than the tableau bar charts of the same info. We give the edge to PowerBI.

A note on data prep (not scored): The in-app data prep with PowerQuery is a no-contest when compared to Tableau. This will pretty much always be true and can be both good and bad. Good, because it allows a lot of control at the fingertips of the analyst, and bad, because a lot of code, either in M or DAX, gets added to PowerBI instead of database engine of your choice.

PowerBI Commentary from Tableau User

Where PowerBI succeeded: PowerBI is more equipped than Tableau to display data in a tabular format such as the one on 538 and that shows in the final product. The ability to put many small views of data into a single dashboard also proved to be powerful in the final scenario modeling output.

Where PowerBI struggled: PowerBI depended too strongly on its native table functionality, resulting in lots of details but a lack of bold & clear visuals. Sorting also turned out to be tough as you may notice when using the final interactive version.

Tableau Commentary from a PowerBI User

Where Tableau succeeded: The data model is much easier to grasp even to technical users. It’s much faster to interact with, and the tool-tips make it very easy to understand. Additionally, clever use of the NBA logo immediately contextualizes the user.

Where Tableau struggled: As Nate mentioned, the data prep took a significant amount of time. The conditional formatting inside tables is not very finely tuned, especially compared to PowerBI. Hilariously, sorting inside a table has its own set of issues (sort on “Win Championship”).

Links

Link: Interactive version of the PowerBI report.

Link: Interactive version of the Tableau report.

Let us know your thoughts below! A list of the files can be found after the jump.

Pages: 12

Charts Reconsidered: Mask Wearing

Like any good analyst, I enjoy scrolling through r/dataisbeautiful. And when I say enjoy, I really mean “I’m doom-scrolling through reddit because twitter is depressingly worse.” Of course this leads me here, to our first entrant into “Charts Reconsidered”, where every week I will revisit a chart from reddit and suggest some improvements.

That leads me to this chart – the 5th ranked chart on the subreddit on July 21st, 2020. It tells us who does and doesn’t use masks, by a few different breakdowns. It is an interesting story, but it could be told in a better way.

3D bars. Yikes. This reminds me vaguely of “WordArt”.

Sorting

There are 4 distinct groups in this chart – Gender, Political Party, Education, & Overall. They are all mashed together with no space.

In excel, I would use “blank” series to add space between each group to improve readability while enabling shared axes. I would also pick a consistent series to sort on from high to low.

Colors

Green & Grey isn’t a great color combo and gets amplified by a lime green gridline color. It’s not a good look.

Keep the gridlines in background, a lighter gray perhaps. For a chart like this, I would use a light and dark tone of the same color. Or you can steal the Ben Evans approach – and use shades gray + a single color for emphasis (in his case, red).

Labels & Gridlines

Too much info crammed into this part which muddles the story. There are major and minor Y gridlines, which are then labeled without a percent sign. The bars are also labeled. Lastly, the X & Y axis labels are switched.

Turn off minor gridlines and make the major gridlines either 25 or 50. Add Percentage labels so the units are clear. And fix the axis labels (or remove them).

Chart Type

3D bars with series stacked front to back is not a good look. This is most obvious in the GOP group, where the labels overlap the bars. The lack of spacing between groups makes it challenging to see differences between groups as well.

Just use a regular, stacked bar chart.

New visualization

With the magic of PowerBI – I’ve crafted a new chart, with the same data, to tell a more visually appealing & easier to understand the story.

Regular stacked bar chart, grouped and then sorted alphabetically.

The labels are removed, the legend is cleaned up, and the colors are simplified. Did it take longer to make this chart? Yes! Does it tell a better story – also yes!

I hope you found this feedback helpful. Let me know what else you would change in the comments below.