bi Archives – DataDuel.co

If spreadsheets are eternal, are BI tools transitory?

A few months ago, Benn Stancil wrote about the eternal spreadsheet. While I appreciated the generous shout out to both mdsinabox and motherduck (my employer), this really got the wheels turning around something that I have been feeling but have only been able to put into words recently: are BI tools transitory?

Consider the following scenario in the Microsoft stack: ingest data with ADF, transform with Fabric (& maybe dbt?), build a semantic model in Power BI, and delicately craft an artisanal dashboard (with your mouse). Then your stakeholder takes a look at your dashboard, navigates to the top left corner and clicks “Analyze In Excel”. How did we get here?

I remember back in the 90s, hearing my dad talk about “killer apps”. The killer app was the app that made your whole platform work. If you wanted your platform to be adopted, it needed a killer app so good that users would switch their hardware and software (an expensive proposition at the time) so they could use the killer app. In my lifetime, I recall a few killer apps: The App Store (iOS), the web (Internet), and Spreadsheets (personal computing).

Spreadsheets allowed a user to make an update in one place and for the data to flow to another place. If this concept seems oddly like a directed-acyclic graph (DAG), that’s because it is. These spreadsheets contain a bunch of features that we find handy in the data solutions stack today: snapshots (save as), version control (file naming), sharing (network shares & email attachments), business intelligence (pivot tables & charts), file interoperability (reading csv, json, xml etc), transformation (power query (there was an earlier, even more cursed version too)). All of these pieces have obvious metaphors in the commonly used data stacks today. Critically, one piece is missing: orchestration (note: back in mid 2010s, I used and loved an excel plugin called “Jet Reports” that included an orchestrator, among other things).
Now if you were running a business in the 90s (like these guys), there was no need for orchestration in your spreadsheet. You, the business user, were the orchestrator. Your data came from many places – memos (later, emails), research (books, later pdfs), a filing cabinet (later, databases), phone calls (later, slack), meetings (later, zoom calls), and your own synthesis (later, chatGPT (just kidding)). Software could not contain these! We did not have the digital twins for these analog processes. In some ways, the spreadsheet was the perfect digital collection point for these physical artifacts.

As each of these parts of our business decision making input processes transitioned to digital, our poor spreadsheet began to fall out of favor. We replaced memos with emails, phone calls with IM (via skype, if you are old enough to remember), and so on. And these digital processes began to produce loads of data. Every step produced an event that was stored in a database. The pace of change in the business environment increased in-kind. Our once per month spreadsheets orchestrated by humans were a bit too slow, processes produced too much data to be aggregated by humans. I fondly recall the launch of excel 2007, which included a new architecture and file format, so that we could process one million rows instead of only 65,536.

Unfortunately, the hardware at the time could not actually handle one million rows. Every single person using excel, unencumbered by 32bit row limits, ran into the hard limits of the Excel architecture and inevitably seeing a spinning, “waiting for excel” icon before crashing (hopefully you saved recently). Hilariously, Microsoft trained users to tolerate an absolutely terrible experience. Excel could do too much. What we needed to do was unbundle this tool, take it apart piece-by-piece, so that we could have good, delightful experiences for our business users. Users could still use spreadsheets for things, but we needed to shift the load bearing intelligence of our business decision making into better tools.

So we built even more powerful databases, and ways to automate decision making at scale. We began to use multiple computers, running in parallel, to solve these problems for us. Large complex systems like Hadoop were required to aggregate all this data. Companies like Google harnessed the immense scale enabled by these systems to become the largest in the world, building never-before-seen products and experiences.

At the same time, CPU clock speeds stopped increasing. We had maxed the number of cycles we could push out of the silicon in the fabs. But innovation found a way to continue – we began to add more cores. Slowly but surely Moore’s law kept on holding, not on clock speed but on throughput.

The software built to take advantage of the scale that was possible with huge quantities of networked computers made assumptions about how to work at great scale across many machines (i.e. Spark). These assumptions did not generalize to single machines with many cores. This has not been unnoticed, of course (see George’s tweet).

So what happened to our business intelligence while this was going on? The number of tools exploded, while the consumption interface remained unchanged. Categories were split into sub-categories into sub-categories. We only had so many charting primitives, and whether we dragged and dropped with Tableau or used BI as code in Evidence, the output looked largely the same. But instead of one tool that we needed in the 90s, we now had thousands.

But I would argue we haven’t added anything new, we’ve merely unbundled it into a bunch of different products and that don’t work that great together. REST APIs have allowed scalable, loosely coupled systems but really suck to work with. Behind every large enterprise data workflow is an SFTP server with CSVs sitting on it (if you are lucky, its object storage and a compressed format, but its the same thing).

If we look at the trends, in 5 years we will have approx. 10x more compute than we do today, and Backblaze estimates that cost per GB of storage will stabilize around 0.01 / GB ($10/TB). If these trends hold, we will easily have enough horsepower on our laptops to put all these pieces that we have decoupled over time, into one box. If BI tools are transitory, spreadsheets are eternal. The era of spreadsheets 2.0 will be upon us.

What are the characteristics of Spreadsheets 2.0?

Runs on a single node with many cores (hundreds?)
One file format for handling all types of data (xlsx++)
One language for end-to-end data manipulation (sql)
A spreadsheet UI for interacting with data at any step in the data manipulation (power query-ish)
Fast, interactive charting (mosaic)
Intelligent, incemental orchestration (dynamic dags)
An AI trained on all these parts above to assist the user in documentation, understanding, and building (clippy but good)

I believe the first building block of this has emerged in front of our eyes: DuckDB. The hardware is being built as we speak (the feedback loop will build it whether we like it or not). Julian Hyde is advocating for “completing the SQL spec to handle metrics” (with apologies to Malloy) – humans have refined this language over the last 50 years and will continue to do it for the next 50. We already have the UI primitives (Excel), so we merely need to bolt these together.

It’s time for the humble spreadsheet to RETVRN. It’s time to bring humans back into the workflow, empowered by AI, to own the data ingestion, transformation, and synthesis required to make decisions. Of course, I’m tinkering with this idea today, if you are interested in what I have so far, please reach out, I would love to talk.

Defining Analytics Engineering

No one knows what it means, but it’s provocative. It gets the people going!
Chazz Michael Michaels

Context: This is the first of a series of posts looking at Analytics Engineering as I’ve lived and breathed it since 2014, largely in the SaaS industry. This is not meant to provide universal truths, but rather give insight into one way to make sense of the data roles around us and how analytics engineering fits into them.

Definition and why the role matters

Analytics Engineering is the process of transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

This is a fairly recent title, popularized by the data transformation tool dbt in this writeup by Claire Carroll. It didn’t really take off until the advent of cloud data warehouses

At first, this role doesn’t seem necessary. Can’t people just … export data into Excel and do analysis? Haven’t data analysts been working for decades without “Analytics Engineers” existing?

Two replies:

Yeah, you can. Which works … until it doesn’t. But eventually scale becomes a problem as people waste time as Excel jockeying instead of whatever else they should be doing
Data analysts and others were already doing this work, it just didn’t have a name yet or was tool-specific (e.g., data munging in SQL Server Integration Studio)

I found myself doing analytics engineering in my very first data job 10 years ago. My title was “data analyst” and I described myself that way, but my work was curating datasets for executives and other business teams to leverage for decision making. The prior Excel-export model was running into scaling issues and I was able to save everyone time by automating data availability.

Analytics Engineering is both old and new. People have been doing the work for a long time, even if the title is relatively new.

How Analytics Engineering fits within the data ecosystem

A simplified view of the three main pillars of the data world

Analytics Engineering fits roughly between two types of data roles many are familiar with: Data Engineering and Data Analytics/Science. The far-too-simple dividing line between the roles is:

Data Engineering: Brings raw data from disparate systems into a single data warehouse

Analytics Engineering: Transforms the raw data within the warehouse into something useful by cleaning, adding business logic, etc,

Data Analytics/Science: Leverages datasets created by analytics engineers to produce analyses, dashboards, models and more for stakeholders across the organization

Reality: Things aren’t that clear cut

Life is never as nice as a graphic of an over-simplified view of the world. Analytics Engineering is no different, it’s a messy role in the middle of messy data. Here’s a closer representation to how these three categories fit together:

I could make this a lot messier. The lines between data teams are incredibly blurry.

Data engineers typically do some analytics engineering. Data analysts typically do analytics engineering. Analytics engineers do some of both.

Which makes sense! With the roles ill defined at most companies (including Analytics Engineering missing entirely!), you’ll get a mishmash of people doing a variety of these tasks. Often times, people are indirectly incentivized to blur the lines in their role to get a task or two done.

What’s next?

We’ll dive deeper into the core value prop of an analytics engineer: transforming raw, unaltered data into well-structured, governed datasets that enable meaningful analysis and measurement of operations.

What Good Data Self-Serve Looks Like

I once was tasked with figuring out how to ‘democratize data’ for internal employees. No other instructions, solely a general pain point of ‘the data team is stuck doing ad-hoc tickets’ and ‘stakeholders want to get data on their own.’ After floundering for a while, I set out to figure out what data self-serve looked like at other companies. Seemed simple enough. But I quickly learned things aren’t that simple, and when are they for cross-functional data projects, anyway?

I want to share what I learned during an earnest effort to stand up data self-serve. I know others are struggling with this same project and its ambiguities and humongous scope. I spent time reading, thinking, attempting, failing, trying again, failing again, trying again and seeing success. Let’s dive in.

Executive Summary

Data rarely moves fast enough across companies to enable data-informed decisions. The data team is a bottleneck behind which many requests stack up in a queue. The data team drowns in questions and stakeholders become frustrated. Long wait times ensue, forcing the business one of three decisions:

(1) Wait to make a decision
(2) Make a decision without data
(3) Departments hire their own data workers.

The data velocity problem is not reasonably solved through sheer volume of hiring. Data workers are expensive and hard to find. Instead, data teams typically pivot to enabling the business via self-serve and data democratization. Ideally, this unlocks the data team to focus on strategic analyses and initiatives and the business is freed to find the data they need without submitting a ticket.

Effective data teams must pivot away from (or avoid entirely!) taking tickets and into partnership with the organization, focusing on building scalable data solutions from which others can self-serve.

Data Self-Serve Definition

Data self-serve is notoriously hard to define. Any definition is inevitably specific to a company and point in time. The definition below is therefore intentionally generic:

Ability for any employee to quickly find and leverage the data and insights they need for their role without funneling through the Data Team

Good Self-Service Always Looks Different

(and bad self-service always looks the same)

Data self-serve done well, by its very nature, looks very different from company to company. The tools, processes, and organization structure supporting self-serve requires tailoring to the organization, rather than following a blueprint.

In contrast, data self-serve nearly always looks the same when done poorly. The data team is overwhelmed with tickets, business users can’t find data and strategic analyses do not happen.

Put another way, bad data organizations tend to look the same, but working data organizations look very different from each other.

This puts data teams in a liberating but uncomfortable position. They’re able to freely define and create the self-service experience that their organization needs at this very moment, but at the same time there’s no blueprint for success. This requires an interactive approach to find the best solution for their company.

The focus must be on the stakeholders’ needs, not the data team’s needs. Data teams have a habit of making data-self serve in their image instead of thinking about who they’re serving. The focus must be and will be squarely on the data self-serve experience of coworkers.

Typical Hurdles to Self-Serve

It’s easy to conjure up a world where someone types or speaks a question into a machine and gets the data they need. This is the ultimate data self-serve utopia and one on full display in the excellent sci-fi show The Expanse. The main characters routinely verbally ask computers questions like “Pull up every ship within X distance which was made by Y company and left Z destination in the last week.”

There are many hurdles before that sort of world is possible. Let’s cover a few of them:

Data operated as a service

Many data teams operate with a “receive a ticket, answer a ticket” mindset. This limits the team’s output into reactive short-term and smaller-scale asks and puts them as a bottleneck between the business and data. This is not an uncommon problem within the data industry. Data service teams are typically overstretched and struggle to answer all the questions coming their way. The business inevitably assumes the data team isn’t a strategic partner as they don’t seem to operate like one. If you operate like a service desk, you’ll be treated like one.

This team structure rarely scales. I recall a discussion with a C-Suite member who criticized the data team as “Getting me answers so late that I’ve forgotten my question by the time they reply.” Ouch.

In contrast, effective BI teams operate with a product-like mindset that focuses on scale and solutions. They partner closely with stakeholders to solve problems and prioritize ruthlessly based on business impact.

Data foundations not yet ready for self-serve

There is an immense amount of work required to get data ready for self-serve. Just cleaning up a few raw data tables isn’t enough. Each part of the business (Sales, Marketing, Product, etc.) need different sets of data to answer their unique use cases.

Prepping data into the right shape requires close partnership and collaboration between the data team and its internal stakeholders. This requires steps like data ingestion and transformation, implementing tooling like Git and dbt and having a team that can support the data lifecycle of a company.

Lack of data literacy

Data literacy, much like self-serve, is a tough term to nail down. This deserves its own discussion entirely, but for now let’s go with a typically squishy definition along the lines of “How well stakeholders can interact with and understand data.”

Training for data literacy is immensely difficult. Even if you have the world’s best data models and data marts and the Modern Data Stack™️ stakeholders will struggle to find value if they can’t grok the internal business data model or fall into common data pitfalls. This hurdle must be overcome no matter how well you do everything else.

Lack of data tools that enable self-serve

Typically there are two primary ways that an individual can self-serve data: SQL on a database or look at a data-team-created dashboard. SQL is great for technical individuals but is not an option for the majority of employees. Dashboards are usually widely available but lack customization. Generally dashboards are a “you get what you get” type of experience, with little to no drill down capability and slow turnaround times from data teams for enhancement requests.

Data teams must provide other options for non-SQL savvy users to explore data in a more ad-hoc sense, leveraging curated/enriched tables created for their department. This can look as simple as providing access to enriched data in Excel or “Reverse ETL” where you send data back to source systems like Salesforce for direct consumption in those contexts.

Data privacy

This varies company to company, but data privacy comes into play depending on industry and company size. And whenever privacy is a factor, data access becomes more difficult. Typical lines in the sand are material non-public information (MNPI) when a company is publicly traded or personally identifiable information (PII) that only specific people should have access to.

Data self-serve almost always runs into data privacy concerns and the height of this hurdle (or wall…) will depend on the company.

But…What Does Good Self-Serve Look Like?!

Even though I described at length that good self-serve usually looks different, there are still some guiding principles to shoot for. These may not be universal and may change depending on your company’s data maturity, but they should be helpful.

(1) Focus initial efforts on specific departments/teams

One common bugaboo is an attempt to boil the ocean. The data team is already spread thin and pivoting everyone to self-serve for all departments at the same time will be too much. Instead, focus on a couple teams or departments with clear self-serve needs. Assign a specific analyst or two for the project who already understands a particular business domain and want to take on the challenge.

This both narrows the scope and increases the likelihood of close partnership with those teams. With partnership will come alignment on business value and understanding of pain points. Everyone wins.

(2) Create roadmap in partnership with stakeholders

Self-serve must necessarily look different from department to department. The needs for Finance are wholly different from Product, Engineering, Field, Legal or Marketing. This is why a focus on self-serve and a dedicated BI partner is so crucial. Requirement gathering and roadmap creation must be done in close collaboration between BI and each department. Examples of requirements to gather include:

Examples of requirements to gather:

* Use cases
* Defining personas (technical/non-technical/etc)
* Tools needed
* Datasets
* Training/Enablement

(3) Build source of truth data marts

A data mart is a set of tables designed for ease of use by a department for their self-serve needs. These tables are specifically curated by the data team to make data easy to consume and understand for a particular department.

Just providing individuals with access to the entire database is inevitably overwhelming. There could be hundreds of billions of data points across thousands of columns and hundreds to thousands of tables. Many analysts need a year to become comfortable with data at its most granular state. Expecting non-analysts to just hop in and find value isn’t reasonable.

To avoid this steep learning curve, a curated data mart enables self-service without overwhelming stakeholders. This curated data mart must be built in close collaboration between the data partner and their stakeholders to find the sweet spot of “plenty of data” and “not confusing”.

Example: The sales team needs a few good tables such as Account, Opportunity and Task from which they can build most any report they need.

(4) Create an adoption and discoverability program

Data discoverability is an enormous challenge that must be tackled on several angles. The existence of data marts alone is not enough to drive adoption if individuals do not know how they exist or do not know how to leverage them.

To drive adoption, efforts must include:

* Training / Onboarding sessions for all stakeholders
* Clear documentation for all data marts, tools available, key reports
* Weekly office hours
* Monthly & quarterly prioritization meetings
* Deprecation process to clean out old/unused data products

Parting Notes

There’s much more to write and I’ll follow up around defining internal stakeholder personas and choosing technologies that solve different aspects of data self-serve. For now, I hope the key message you took away from this is:

“I’m empowered to figure out how to best do self-serve at my organization.”

There’s an art to this task, and that’s why it’s so difficult to find anyone giving a blueprint. There really isn’t one. And you’ll never “arrive” at the conclusion of this project. You’ll just continually improve it, much like you do all your other data efforts. The fun is in the journey.

Start Simple With Your Analytics Project

Start Simple & Iterate

Up to this point, I’ve largely written for those looking to break into an analytics career. Today I’ll go beyond that and discuss the most powerful lesson I and many others learned — something I wish I fully understood starting out:

Start your analytics project as simple as possible and iterate from there.

This strategy borrows a lot from Agile software development not because I’m a student of it, but because I learned the values of Agile through trial and error. Only after I stumbled upon this strategy did I learn how closely it aligns to the Agile methodology.

The Common Mistake

I’m going to assume you’ve already solved the toughest issue in analytics: identifying an ambiguous problem. Congrats! Now you need to figure out how to make it happen. This is where things can go wrong.

Many analysts (myself included!) are then tempted to:

Retreat to your office
Gather & clean all the data you think everyone needs
Build the World’s Best V1 Dashboard
Schedule a meeting to present the dashboard
Receive unanimous praise for how amazing it is
Watch as everyone uses your dashboard daily

What really happens:

Retreat to your office
Gather & clean only some of the data people need
Spend way too long building the Dashboard No One Really Wanted
Stakeholders email you intermittently asking if you’re making progress
Schedule a meeting to present the dashboard
Entire meeting spent fielding questions like “Why don’t I see X or Y?”
Get the cold sweats realizing you don’t have what they need
Stakeholders frustrated that so much dev time was wasted
You’re frustrated that they are “changing what they need”
Retreat to your office

Why Does This Happen?

Every data analyst/scientist makes this mistake. It will continually happen throughout your career, even after you think you’ll never make that mistake again. No one is immune.

There is one core reason why this happens: You assume you understand what the stakeholder wants.

Except you likely don’t. Especially when you’re early in your career. You’ll think you’re on the same page with your stakeholder, but you aren’t. You think you know what data points the stakeholder needs, but you don’t (hint: the stakeholder likely doesn’t know either!). You think you know what kind of visuals the stakeholder will find most useful, but you don’t.

In fact, it’s so difficult to get everything right the first time, you should assume you don’t fully understand the request. That one time you actually do build “The World’s Best V1 Dashboard”, celebrate the unexpected success – it won’t happen often.

Strategy: Start Simple

There’s a solution to this problem: Start your analytics projects as simple as possible. This results in less wasted time in development and happier stakeholders at the end. The process looks like this:

Agree with stakeholder on an MVP (Minimum Viable Product) – something small that can be done quickly
- Your stakeholder may not know exactly what they want, so you may have lots of freedom here
Gather & clean only the data you need for the MVP
Create MVP dashboard
- Ask your stakeholder questions here, too! You don’t need to go radio silent and many times they’ll appreciate the feedback loop
Present MVP dashboard to stakeholder
Gather feedback from stakeholder
Start process over again

This process is designed to be quick, with small iterations should building on each other until everyone agrees the dashboard fits the needs of the business. The more interactions with stakeholders the better – you’ll quickly identify misalignments, missing data, new requirements, changing business needs and more.

The advantages should be clear. Stakeholders will feel ownership over a product they helped develop (leading to better adoption!). The end product will be closer to what the business needs (leading to better adoption!). And stakeholders will remember the success of the project and give you a call for the next one.

Conclusion

Don’t try to build Rome in a day on any analytics project. You’ll rarely succeed. Instead, iterate and build on a project until it becomes something useful – and likely looks nothing like what you thought it would starting out.

Analytics is a dynamic field. Don’t fight upstream with how quickly things change; set up your work process to allow for quick changes. Your company & future self will thank you.

New Weekly Series: Everything Analytics

Do you enjoy working with data in your current role? Are you interested in a Data Analytics career? Are you currently a Data Analyst?

Good news! This weekly series is for you. It’ll cover all sorts of topics within analytics, including advice for aspiring analysts, best practices, key skills/tools and industry updates.

Initial blog topics include:

The Many Wandering Paths to Analytics
Analytics Job/Role Types
Key Skill Sets for Analysts
Visualization Best Practices
Measuring Success of Analysts
How to Prioritize Your Work Backlog
…and more!

Much of this will be written from my perspective as an Analyst. There are other perspectives out there for unique positions like Data Scientists and Data Engineering, and while I’ll touch on those regularly (and will write an entire post on the difference between those roles), the focus here will be Data Analysts.

See you in a week!