Part One: The Importance of Knowing What You Know.
Neighbor A picks up his cup of coffee on Saturday morning, walks out the back door, turns on the spigot, and waters a bed of azaleas a crew of landscapers planted the day before.
Rising with the sun, Neighbor B slips on her Muckster clogs, grabs her shovel, shears, and compost from the shed, and goes to her knees in the flower bed she tilled yesterday, preparing holes and soil so that the shrubs she plants there will thrive.
Which person is the true gardener? This isn’t a trick question.
Switching from the organic to the technical, let’s take a look at data scientists. Data scientist A reviews data, builds a model with a few lines of code, generates some results, and distributes a nice report. Data scientist B, on the other hand, digs into the data, monitors sources, assesses change over time, evaluates how the data behaves, challenges assumptions, and then . . . only then . . . builds a model.
Which person is the true data scientist? More important, which person would you trust to deliver the confidence you demand when making decisions that affect the efficiency and profitability of your business?
The one who practices exploratory data analysis.
What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is an investigative process you can use to analyze information that gives you a more nuanced view of data and what you can learn from it. It’s often the first step in data analysis and modeling, and it’s something you do to get a feel for data sets – to see patterns, spot anomalies, test a hypothesis. With EDA, you can generate questions and check assumptions, which you can eventually test using a variety of statistical methods. It’s more about exploring than confirming.
EDA also allows a deep dive into data variables to determine not only how they relate to one another but also how they connect with the outcome being modeled. This deep dive for especially effective modelers includes understanding data in the context of the business uses it supports and how it applies to real life situations.
EDA is often described as detective work in that you’re searching for clues and insights that can help you identify potential root causes of the problem you’re trying to solve. You can “explore one variable at a time, then two variables at a time, and then many variables at a time.”
Why Use Exploratory Data Analysis for Data Modeling?
While new tools have democratized predictive modeling and made it possible to build models very quickly – most agree that model building is the sexy part of data analysis – the adage still applies: what goes in is what comes out. Models will always generate results. But knowing and understanding each piece of data you have is the key to producing meaningful results that have practical application. Good quality data is “correct,” but that doesn’t mean the detective work is done.
What can happen if modelers don’t use EDA? Here are a few possibilities:
Model Noncompliance. If modelers don’t understand the data and the underlying connection to the population, the model can violate compliance regulations. This might happen if, for instance, you inadvertently exclude parts of the portfolio and data has changed over time. Or it could even result in disparate impact if protected classes of people are treated differently.
Unintuitive Results. While using all the variables and correlations between the variables and the target can produce what seems to be a strong model, the validity of the results might be questionable if the model doesn’t generate a predictive outcome consistent with the business.
Here’s an example. A few years ago, a financial services company ran a model and discovered, they thought, that rising unemployment made credit delinquencies go down. What they overlooked was the fact that the credit profile of customers had changed considerably a few years prior, and the economy had improved. Those two variables were linked, and they created noise. Of course, it makes no sense from a business standpoint to suggest that unemployment is good for credit. Through effective EDA, a seasoned data scientist would know how and when to remove variables that will not add to the model’s ability to predict the outcome.
Leakers. Predictive models give a false sense of accuracy when they include variables in the modeling sample that come from the future rather than from the period during which performance is being measured. When this occurs, a model can appear to be highly predictive during the model build phase but not as predictive after implementation. These future variables are called “leakers” because they leak future information into the model building, and they are not true predictors of future performance.
An example of how a leaker ruins data can be seen in a model developed to predict the probability of prospects to respond to a marketing campaign. The model uses a database of consumers who were marketed to in the past and a database of consumers who opened a credit card. However, the database of consumers marketed to is updated daily. So when using the variable “uses credit cards,” the information includes the credit cards that were opened in the present, the very target the model is trying to predict. Current data leaks into the model and compromises its predictive performance.
How is EDA Used?
EDA is used primarily “to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for [model building] are appropriate.” While understanding your data is critically important regardless of how you plan to use the information, it’s especially important for modeling. After all, how well you understand your data determines whether you’re building your model on a solid foundation of knowledge or unchecked assumptions.
EDA includes a variety of steps, and modelers apply the concept if different ways, some more thoroughly than others. Here are a few of the ways to get a qualitative understanding of available data as a check before making any assumptions.
Data Research. Modeling data needs data from the past that ensures sufficient time for performance to be measured. Any EDA process should involve researching data points over time to make sure the information being used for modeling reflects characteristics of the current portfolio.
Univariate Analysis. As you conduct your EDA, it’s important to understand each individual variable by itself, which includes evaluating distribution, fill rates, outliers, and seasonality, as well as coverage by dimensions such as time, product line, and geography. It’s also important to understand how the attribute is sourced and defined.
Bivariate Analysis. What’s the predictive power of each variable in relation to the model target? Statistical methods such as correlation analysis, information value, and the KS test can help you answer that question.
Multivariate Analysis. What is the relationship between the variables? Are any of them redundant? Is there an interaction between the variables that can better explain the target? Correlation matrix and variable inflation can be used to understand the relationships. Machine learning methods such as decision trees and random forests can be used to understand interactions between variables, too.
Business Perspective. Data scientists and subject matter experts on the problem the model is trying to solve for should review the EDA to validate whether the data behaves as expected. By doing so, they can help assure that any modeling or analytical effort will produce an explainable and adoptable model – and make the results intuitive for business decision makers to act upon.
EDA: The First Step Toward Creating Highly Predictive Data Models
Flying Phase recently published “Increasing Confidence in Model Performance Monitoring Through Automation,” a case study that explains how we helped a client restore confidence in results produced by its predictive models. Model monitoring, as the case shows, is an operational task that requires checking your machine learning models to make sure they’re performing with the utmost efficiency and accuracy. You might think of it as the equivalent of your annual physical check-up. What makes monitoring so important is that it helps guard against model drift or degradation over time.
It’s this reality, the entropy of model performance, that makes a solid case for why EDA is so critical. It’s one thing for a model’s performance to drift over time. That can happen. But it’s quite another to build a model that is based on weak assumptions and relies on inaccurate data. Why should a predictive model fail from the start? If the success of your business depends on the accuracy of your predictive models, and there’s no doubt that it does, EDA should be as much a part of your approach as model monitoring. It should be step one.