Diving Deeper

Responsible Machine Learning: Three Rules of Thumb for World-Class Monitoring

Summary Solution: Highly regulated industries can better leverage machine learning tools (and create trust in their findings) by developing monitoring systems that are holistic, insightful and actionable.

Over the past decade, we’ve seen an explosion in the field of machine learning. Tech giants, online retailers, and social media platforms all have adopted machine learning strategies to evaluate data in ways that human beings and traditional modeling frameworks cannot. These new machine learning models provide a better fit for non-linear relationships, uncover unique insights and are better equipped to pick up on the nuances of a shifting environment.

It’s no surprise, then, that larger, more foundational industries, such as banking and healthcare, are searching for ways to reap the benefits of these tools in a responsible way. After all, in massive industries where small reductions in model error can equate to millions in savings or massive public health changes, machine learning offers model owners a compelling value proposition.

However, along with their benefits, machine learning models present unique risks. Model owners in heavily regulated industries may want to reap the rewards associated with a superior forecasting tool, but are often hesitant to take ownership over dynamic machine learning systems due to their potential failure modes. These hesitations may be related to black-box obscurity, discomfort in the reliance on automation or machines over manual efforts, or doubt in the system’s ability to self-optimize.

Luckily, there are ways to minimize these risks in robust model monitoring systems. By being thoughtful about how the health and performance of their ML system is monitored over time, users can harness the full power of machine learning with confidence and ease. While these monitoring frameworks may come in all shapes and sizes, truly world-class monitoring systems are built on three foundational principles: They are wholistic, insightful and actionable.

Rule #1: Create a Wholistic System

Machine learning systems are, by definition, filled with dependencies. A model is only as good as the data it’s built upon, and predictions are only as strong as the algorithms they’re scored from and insights from the analytical toolset generating them.

These systems are akin to a delicate ecosystem (compared to legacy single-organism systems), and new data scientists are often quickly surprised to find that, in the real world, machine learning code makes up only a small percentage of the overall structure. Therefore, it’s imperative for robust monitoring systems to take a wholistic view that not only examines the core ML algorithms, but also each of their interconnected pieces and how they interact with one another.

As information gets funneled through a machine learning system, it’s imperative that robust monitoring guardrails are established at every step of the way, including pulls from raw data sources, model training and scoring, and even the methods that are used to run analysis over your outputs. Automated checklists should be established within and between each module to ensure the integrity of your results, and only if expectations are met should data continue to flow unimpeded through the pipeline.

Understand the Interconnectivity of Your System

In real-world ML implementations, nothing happens in a vacuum. Upstream process changes, system communication failures and user error are just a few of the risks model owners may face on a day-to-day basis. In the same way that no piece of your ML system truly executes in a vacuum, these pieces also should not be monitored in a vacuum. While it is common to treat ETL, forecasting and analytics as completely separate modules, their interconnectivity in a robust ML system makes it imperative to also monitor the handoffs and relationships between these steps.

Work to build out monitoring steps that not only relate to the performance of a specific module, but also attempt to trace back to their upstream motivators. For example, when monitoring a model build process, a common tactic could be to monitor data drift and changes in the feature importance and population stability indices. While a standard monitoring system may flag data drift between model train and test data, robust systems go the extra step and trace back to upstream motivators such as ETL processes, DQ checks in the latest data, movement of data between tech systems and population sampling strategies. This strategy is not only important for peeling back additional layers of a black-box system, but also in correctly attributing performance changes to their true points of origin.

Rule #2: Your System Must Be Insightful

Oftentimes, monitoring systems may provide comfort for model developers and validators, but they don’t provide the required insight to inspire confidence from the business. This tends to occur when model developers and business leaders have differing views on the reliability of their machine learning system and what constitutes success.

For example, model developers may gain comfort by tracking model errors over time and keeping a close eye on potential data drifts as a signal of the need for model retrain efforts. If monitoring over time demonstrates that the model performs within an acceptable limit, it’s likely the developer will be comfortable with the system’s performance. However, for many model owners in the business, that may be insufficient – and monitoring systems focused entirely on aggregate model-level errors and historical performance may miss the mark entirely.

The disconnect lies in the metrics and their inability to answer higher-level questions. While traditional monitoring may give you the “what,” it rarely answers the “so what” or “why.” It also tends to ignore the dollar impacts of model misses – something the business owner is sure to be focused on.  Effective model monitoring frameworks take the needs of both developer and business user into account.

Attributes of Insightful Monitoring Systems

It’s common for model monitoring systems to monitor historical performance at an aggregate model or sub-model level because of their usefulness in model training and deployment. But, as we’ve discussed, this granularity is often not meaningful for the business. In addition to traditional views, monitoring should focus on meaningful business segments (line of business, program, product, customer type). This allows the business to gauge reliability at the granularity that matters most to them. A model may perform well in aggregate, but if it fails within a segment that is particularly meaningful for the business, then there may be major consequences.

These monitoring systems also tend to be backward looking, drawing on historical performance as a proxy for current performance. While this is particularly meaningful for model developers who are planning rebuild efforts, entirely backward-looking monitoring systems often miss important metrics for the business.  Instead, developers should also focus on “live bullets” by incorporating information related to the most recent model runs to create a complete system.

Even without actuals to compare to, the newest information can be effectively introduced into your monitoring by considering month-over-month changes in predictions, volatility of the results and anomaly detection in outputs. Where hard data is unavailable, incorporate additional business expertise. Ask yourself, “Do movements or shifts in the shape of the model results remain intuitive when considering internal policies or outside factors such as the competitive or economic environment?” Remember to continuously challenge your results, drawing on business expertise as a second line of defense, even when outputs flow successfully through your monitoring system.

While many model monitoring systems focus on generating reports, it’s important to answer the “why” questions to derive higher-level insight. For example, many model monitoring systems may track the stability of predictions over time. In this scenario, a model developer may see that predictions are becoming more volatile and take it as a sign the model may need a rebuild. But a great monitoring system would dig one step deeper by asking, “Are recent changes in performance truly due to model degradation, or are there environmental shifts that your model is picking up on?” The ability to tease out true signals from degrading model performance is key to building a robust model monitoring platform.

Rule #3: Above All Else, Your Model Should Drive Action

Truly world-class monitoring systems focus on action over reporting. It’s common for routine monitoring reports to be ignored, misinterpreted or altogether forgotten. And by the time reports are compiled, analyzed and presented to senior audiences, a faulty ML system may have been running for days, weeks or even months longer than it should have. Therefore, it is essential that your systems include automated safety valves, in addition to standard reporting, to ensure the reliability and health of the ecosystem.

Machine learning risk can be mitigated by establishing guardrails alongside automated safe modes that are triggered by monitoring results. Safe modes can be based off heuristics (e.g., rolling averages, linear trends, reversion to long-term means), traditional statistics (e.g., linear/logistic regression), model rollbacks to the last point where the model was stable, or a rollback to a more simplified version of your model (i.e., eliminating specific features). Safe mode triggers could include degrading model errors, data drift and unstable feature importance values.

The Democratization of Data

In 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth, and that number continues to increase. (Thank goodness for the cloud.) In a world that is becoming increasingly data driven, the benefits of a well-engineered machine learning system are undeniable.

Until now, many non-tech industries have had to watch from the sidelines, unable to unlock the full power of their data due to regulatory scrutiny and discomfort with potential risks of more autonomous systems. But newer, more advanced monitoring techniques are enabling historically conservative industries to adapt to the changing tide, inspiring confidence in the technology from both regulators and end-users.

While following these principles will create a strong foundation for any ML monitoring system, the model development lifecycle is truly becoming a cross-functional effort. Now, more than ever, it is essential that model developers and business experts come together to enable responsible innovation and craft the ideal solution for their use case.

If you need help facilitating that dialogue within your organization, or if you’d like for a team of strategists and developers to help you establish that framework, Flying Phase can help. Click here to better define which supports your organization’s needs best.