Part Two: A Playbook for Using EDA to Build Predictive Models You Can Trust
If the success of your business depends on the accuracy of your predictive models, and there’s no doubt that it does, exploratory data analysis (EDA) should be as much a part of your approach to model building as model monitoring. It should be step one.
EDA is a critically important investigative process. Using it to analyze information gives you a more nuanced view of data, what it really means, and what you can learn from it. It’s a process that allows a deep dive into data variables to determine not only how they relate to one another but also how they connect with the outcome being modeled. This deep dive for especially effective modeling and analytics includes understanding data in the context of the business uses it supports and how it applies to real life situations.
Part One of our post on EDA, “The Importance of Knowing What You Know,” explains how the process is like detective work, helping you pinpoint clues and insights you can use to identify potential root causes of a problem you’re trying to solve. In this second part of the blog, we provide you with a series of tactics to follow as you apply EDA in building predictive models you can trust. As you complete each step, you begin to understand your data in the context of the problem you’re trying solve. You also improve the predictive capacity of your models and generate insights that strengthen your business.
Does EDA require a lot of work? It just might. But it’s vitally important to your success. We like to think much effort means much prosperity. The process starts with univariate analysis.
Univariate Analysis: Understand Each Variable
EDA begins with univariate analysis. That means you start the process by looking at one variable from a given data set at a time, irrespective of other variables. If you’re exploring customer credit risk, you might analyze variables such as customer financial data, economic conditions, and an individual’s past payment behavior. This analysis includes looking at data distribution and fill rates over time, possible outliers, and even seasonality.
As you begin your analysis, it’s important to know the source of the variable, especially if you’re buying data. Data source is something that’s often overlooked or missed due to the assumption that the data has been vetted during the contracting process or based on the reputation of the data provider. However, even if you’re using a trusted source, it’s important to monitor the data for changes over time. This monitoring includes evaluating what is used to create the data or what version of a scoring algorithm was used.
Regarding individual variables, it’s important to understand how they’re defined. You can start by asking several critical questions. For instance, have any changes occurred beyond your control that affect the meaning or values of the variable? Do you expect any in the near future? How would these changes be communicated so that the model can be adjusted over time? If it’s a numerical attribute, do you expect changes to the magnitude of the variable in the short, medium, or long term? How would these changes be monitored and adjusted for in the model?
Since you’re using EDA to build predictive models you can trust, it’s worth the time and effort to avoid certain pitfalls. For instance, when you use purchased data, make sure you understand if there are controls in place to protect the quality of each version and the time frames for phasing out specific variables. In addition, if there is null or missing data, be sure you know why. Also, be sure to analyze whether the fill rate of each variable – what percentage of the variable values is missing or invalid – makes sense before including it in the model you build. Though clearly important risk predictors, collections and bankruptcies are rare, so you can expect a low fill rate. On the other hand, card balances should have few missing values but a high fill rate.
Time Series Analysis: Using Historical Data to Inform Judgement
When you talk about modeling, you always talk about observation and performance periods. The model is built using data from the past, which is the observation period. The outcome is built based on the performance measured from the observation period to a defined point that is as close to current as possible. With time series analysis, you research your portfolio of data points over an established observation period. You then compare it to the current data. In doing so, you can make sure the data you use for modeling reflects the characteristics of the current portfolio. This process helps ensure that the meaning and fill rates for each variable haven’t changed significantly over time.
Two methods you can use to make sure your distribution has not changed over time are characteristic stability index (CSI) and population stability index (PSI). CSI is a standard method you can use to compare the modeling data set to the current portfolio. It’s best to employ CSI early in the modeling process, perhaps when you begin to conduct EDA. Doing so, you save valuable analysis and modeling time by identifying critical differences in the data overtime that would render a model ineffective and require developing a new modeling dataset. CSI can also be used after you implement the model. PSI allows you to monitor the model for changes in the attributes and population and can warn you of underlying changes to the population that require further modifications to the model once it is being used in real time. As you do this, be on the lookout for variables that have a substantially different fill rate or distribution over time or that have had changes to their definition over time.
Bivariate Analysis: Evaluating Each Variable Against the Predicted Outcome
While univariate analysis focuses on single variable from a data set, bivariate analysis concentrates on the empirical relationship between each variable and the outcome you’re modeling. You can use a number of statistical methods to accomplish this, including correlation analysis, information value, and the Kolmogorov-Smirnov test.
Just as there are with other methods of analysis, there are pitfalls to avoid with bivariate analysis. For models that require regulatory compliance, a linear relationship between the variable and the outcome is expected. For most regulated models, bell or U-shaped relationships can be too difficult to explain and have to be carefully evaluated and treated.
Multivariate Analysis: Examining All the Variables at Once
With multivariate analysis, you look at all the variables you’re trying to model at once. As you do, one goal is to understand the relationship between the attributes and, through correlation analysis, determine if some of the variables explain similar behavior and are redundant. This redundancy can cause confusion when building a model, either by making it unnecessarily complicated or by including variables that are not needed to explain or predict the outcome.
As you assess whether a customer is a credit risk for a credit card account, for example, you may look at an array of variables. What credit cards does he have? Does he pay them on time? If he also has a car loan, does he pay that loan on time? What about his rent or mortgage? You can reach a point where some of these variables provide the same information and reflect the same behavior. That’s correlation analysis. Which of these variables provides you with the best signal about your data? At the end of the day, simplicity is preferable because it’s more sustainable and easier to manage. If you are relying on fewer variables to explain core behaviors available in the data, the accuracy of your model is more enduring.
Machine learning methods such as decision trees or random forests can also prove useful as you perform EDA because they help you understand the features of each variable, the relationship and interaction between variables, and how data is best used for modeling. Some combinations of variables can increase predictability. This can be particularly useful for heavily regulated areas where an easier to implement model technique such as logistic regression is preferred over more complex models, which inherently include interactions between variables.
Business Perspective: What Do Subject Matter Experts Think?
Data scientists and subject matter experts working on the problem the model is trying to solve should review the EDA to validate whether the data behaves as expected. By doing so, they can help assure that any modeling or analytical effort will produce an explainable and adoptable model – and make the results intuitive for business decision makers to act upon. Your goal is to make sure data reflects expected trends that will generate a model that has practical value in a business context, one that provides critical insights and allows you to execute actionable strategy.
An agile style of model building, where data scientists and business experts review and discuss results regularly throughout development, ensures a faster and more robust modeling process. Not only does it provide you with new or interesting insights on the portfolio, but it also helps you involve the business experts when selecting between variables that are redundant. You’re also able to assess your intuition about your data and determine whether you might have missed a signal it was sending during the EDA process.
In highly regulated environments, this exchange of information is vital to achieving a defensible and explainable model from a compliance perspective. It puts transparency into the model, and that’s especially important when your business must be able to explain the model to consumers or regulators for compliance purposes.
Get the Most Impact from EDA
How do you ensure the success of your EDA?
The most successful projects are the ones where the person leading the modeling effort understands the business and the customer on the business side understands what the modeling team is after. That partnership is how you have the most success. Everyone on your team – data scientists and business experts alike – works in sync. While that balance is often difficult to achieve, the modeling team should focus on bridging any gaps, taking the time to build the model with input from people who really know the business and have a clear vision of the outcomes it aims to achieve.
When your EDA is done well, there are no surprises. The model behaves just as you would expect from a business perspective, and the outcomes you were predicting align with the results you were seeing during the modeling process. At the end of the day, it comes down to a simple practice. Quality in equals quality out.