Insight

    Separating Signal from Spin: Understanding How to Evaluate AI Solutions

    How To Properly Evaluate Revenue and Churn Predictions Without Falling for the Hype

    By Bruno Velloso, DataML Engineer & Principal Economist
    GrowthAI Dashboard visualization

    Sifting through all the Noise

    You, like me, are probably getting an unending onslaught of emails and articles about too-good-to-be-true AI solutions to a wide variety of business problems, and are likely struggling with how to separate things that are actually useful from things that are mostly hype. It's a problem I am acutely aware of, especially as someone who makes a living creating applied AI and machine learning solutions from raw platform usage and telemetry data. I often see claims about near-perfect accuracy for predicting churn, citing high percentages and vague metrics that are, frankly, difficult to understand; even for someone, like me, who has the technical knowledge to do so.

    So I want to write something about how best to sift through all these potential solutions you may hear about, almost all of which are hype, and of which a select few may actually be useful. It's sort of an internal checklist of what I, and QuadSci, look for in a successful model, and what we do to ensure that the claims we are making are viewed as credible.

    I am, obviously, biased by the fact that I think the work we do at QuadSci is verifiably valuable and useful. At the least, I can try to explain the steps we take to ensure that the models we train are powerful and useful, and, more importantly, that the results are easily verifiable.

    Make Sure You Understand Exactly What Any Evaluation Metric is Measuring

    The most important thing is that you should take any number with a big grain of salt. Context matters a lot. If you are hearing vague grumblings of 99% accuracy, without a clear sense of what that actually means, it's likely too good to be true.

    You need to understand exactly where that accuracy metric is coming from, how it was calculated, and understand the business context. It's something we work through extensively with clients as they get onboarded. We take the time to understand from their perspective what outcomes they care about, and then we construct KPIs and evaluation metrics that account for that. There is no "one size fits all" solution here. Some important questions to ask are:

    • What does "accuracy" actually mean? (There are A LOT of definitions in machine learning).
    • What data are you using to test the model? (It should be on a never-touched sample, a completely separate sample from the data used for training, both in time and across users).
    • What exact outcome is the model targeting? (Often the targets the model is trained on are misleading or unclear).
    • How can I verify this claim or that this model is useful?
    • What does "good performance" mean in this business context? (Some businesses are inherently much less predictable than others).

    At QuadSci, we provide a variety of metrics and clearly define them in advance of training. For our multi-class GrowthAI model, for example, which tries to predict 5 distinct outcomes (from high growth to contraction to churn), the metric that seems to be the most commonly accepted measure of "accuracy" is recall. In the context of predicting whether a client will churn, renew or grow, these are questions we typically ask:

    "Here are some surprise churns that happened recently. In what share of those accounts did QuadSci's model accurately convey risk 6-12mo in advance?" Or, "Of the accounts that grew the fastest, how many of them were viewed as high-growth candidates by QuadSci's models 6-12mo in advance?"

    Our goal is to look at the share of churn (or growth) events that we properly identified would happen within 6-12mo ahead of the fact, on data never-before-seen by our model (averaged across all our clients). Some businesses are inherently much less predictable with contracts ranging in term and timing while annual contracts are more predictable. The value on the resulting recall—the "accuracy" numbers you may see in the market—depends on the additional value you can gain from the model relative to what you already know.

    I strongly caution against focusing on any one metric. We made a very intentional choice, for example, to make our model more useful by predicting a range of outcomes rather than JUST predict churn. This gives you a much more complete picture of the likelihood and causes of different possible growth trajectories for each account (from rapid growth to flat renewal to partial/full churn), but it makes any single binary classification metric appear artificially lower.

    In the end, we ascertained that it's more useful to have a model tell you an account has both high potential for growth and high risk of churn, even if it seems contradictory at first glance (newer customers tend to churn at a high rate but also grow at a high rate, for example), than being tethered to a single, binary metric.

    It's Much Easier to Evaluate the Model's Insightfulness and Usefulness than its Predictiveness

    I have a strong bias for models that are explainable and transparent. Neural Networks can be incredibly predictive and useful, but often they can over-memorize training data (and thus degrade quickly over time and out of sample), and you can't easily verify WHY it is predicting something will happen.

    Understanding why a prediction is made is critical: if you do not know why something will churn, there's not much you can do about it.

    To that end, we painstakingly convert these billions of telemetry data into understandable signals via our proprietary data processors, and we neatly summarize the most important (understandable) signals that are driving our model's predictions. This tends to be where people lean in.

    From that data comes a mix of insights that (1) our customer already knows about given their years of business experience, but they are impressed the model picks up on; (2) things that are new to the client but make logical sense to them and; (3) surprising insights they would have never thought about, but upon further examination make sense.

    Once someone is comfortable our model is accurately diagnosing usage behavior within their products, and that it correctly understands the sophisticated parts of the platform that indicate stickier usage and the types of activity that indicate risk, the conversation turns from evidence to action.

    It's hard to refute a model that is picking up on important usage patterns and correlations, especially when there is nuance. For example, a spike in admin activity can be good or bad depending on the broader usage profile and our models do an excellent job highlighting when it is a cause for concern or just an example of sophisticated usage.

    So, as a numbers guy, I'm basically telling you to ignore the numbers. Look to areas of your business or platform you understand well, and see if a model is correctly assessing that behavior well. This is the only way you can know if the underlying model is picking up on the correct patterns and not just spurious correlations or confirming pre-existing conditions.

    Ultimately, Just Test it In the Field

    We've found that people don't fully buy in until they can "feel" the product. So after a trial, once we've trained our model and finished our dashboard, we tend to do something we call "dealer's choice."

    There is typically a month or two of recent data that we were never given access to by the time we have gotten to this step. We simply ask our client (on the spot) if there are any accounts that they know well that they would like to profile in our dashboard. With no way to prepare in advance. Often, they pick a recent churn that took them by surprise. Nine out of ten times, we correctly identify the risk in the account, up to a year before the event.

    If you look through enough accounts I think it becomes clear the level of detail and predictiveness QuadSci can give you about an account, from predicting the overall health of a customer, to specifying its biggest strengths and weaknesses to identifying potential solutions based on training our Q-Chat agent on your product documentation and our model outputs.

    Conclusion

    If you come across an applied AI or ML solution, make sure you understand the evaluation metric well, and that you can verify and evaluate the model's insights and its predictions directly in the field. Otherwise, give it a pass.