Is AI testing like software testing?

Share it:

There’s much similarity between testing applied AI/ML (For brevity, I’ll just refer to these as “AI”) applications and traditional software[1]. Regardless of the vertical and the type of model and architecture, these similarities have worked in the industry’s favor as traditional software CI/CD processes and technologies have readily embraced AI applications with a plethora of tools available today. But despite these similarities, the differences are profound and sometimes subtle due to AI’s greater dependence upon data.

 

What does it mean to train an AI model?

In practice, “Training” an AI model includes a great deal of testing by its very nature. Training, and a certain level of testing, can’t be separated. In fact, the phrase “training data” often implies training and test data as the data selected for developing the model will be divided into training data and “holdout” data used for testing. The holdout data is never seen by the model during training. This is important. And there may be multiple holdouts for various test activities.

 

Think of the training/test relationship this way. Imagine a boxer preparing for the next big fight. He does roadwork, speed bag, heavy bag and jumps rope (trains) to prepare. He trains like this for a while and then spars with a partner and his trainer assesses (tests) how well the sparing session went. If it went poorly, the trainer adjusts the training regimen (“More roadwork. You were tired out by the second round!”) and they do it all again. Eventually, the trainer is pleased with the sparring results and deems the boxer ready.

While this isn’t quite the way boxers train, it is how AI models are trained, though it is all automated and may execute hundreds of thousands of training/testing iterations. But unlike the boxer who can work with any old heavy bag, an AI model is much more sensitive to the quantity and quality of the data used in its training and testing. Incorrectly sampled real-word data or improper separation of holdouts means a bad model as these examples demonstrate.

 

Why is data so important?

AI models depend upon, and are sensitive to, data much more so than traditional, prescriptive software solutions. Say we’re going to build a cat detector that can classify a picture as containing an image of a domestic cat or not. We have to consider many subclasses of data: different breeds, colorations, poses, distances, and image characteristics such as lighting and orientation. And our classifier must reject things that look like cats but aren’t, so we’ll need a bunch of not-cat images. We need to consider how many of each of these subclasses, and combinations of them, are needed to train the model to recognize cats in their myriad forms and reject things that are not cats.

 

The same demands for quality data exist for non-image AI applications as well. To train a model to detect problems with medical forms we need sample forms that are complete and correct, samples with misspellings, incomplete sections, non-existent sections etc.. Model performance is defined by the degree to which training/test data represent the real world. Treat the data as a first-class citizen with proper planning, collection, configuration control, and expect the data set to change over time as test results dictate additional, specific subclasses of data are needed. “We need more sitting cats!”

 

Why are my engineers biased?

When an AI engineer trains and tests a model, they make hundreds of decisions that affect model performance. They can inadvertently make decisions that introduce bias, particularly when trying to squeeze that last bit of accuracy out of a model.

 

At SphereOI, we use multiple levels of holdouts, independent testing, and restricted feedback to eliminate this risk. Once an AI engineer has trained a model, a team lead tests with a second set of holdout data that neither the model nor the AI engineer has seen. Importantly, the results of this test are communicated to the engineer as only a pass or a fail. They are denied access to the detailed performance results. This prevents the engineer from analyzing model performance on the unseen holdout data and adjusting training/testing regimens in some (inadvertently) biased way.

 

When am I done testing?

Data changes over time. Introducing new products means the recommendation engine must be re-trained. The more insidious case is when subclasses within real-word data change over time. For example, our cat classifier may have met requirements when initially deployed, but a new breed with heretofore never seen coloration may degrade model performance. Unlike software regression test data that can remain static once created, AI systems should be evaluated against current, real-world data on a regular basis. To some extent, testing never ends.

 

We handle this a couple ways at SphereOI. Sometimes we incorporate into our CI/CD pipelines the periodic augmentation of our regression test data with current real-world data. In other cases, we develop real-time evaluators of system performance and deploy with the production system as a form of built-in test.

 

How do I perform independent testing?

“Independent” testing and verification activities need to focus on the “independence” of the data AND ensuring the data is real-world representative. If the data is not independent, a true holdout, then there is a risk of not detecting poorly performing models because the allegedly independent testing will have been tainted by using data the model saw during development. This is known as leakage. If the data is not representative of the real word, then the independent testing will fail to surface cases where the model underperforms when encountering subclasses outside of those upon which the model was developed. Just like developing a model, independent testing/verification of a model needs to treat the data as a first-class citizen and thoroughly assess how the model performs against independent (of development), real-world data.

 


[1] To be clear, I’m writing about applied AI engineering – using AI techniques and tools to solve real-world, industry-specific problems. I’m not addressing the creation of what we often read about coming from the tech giants who develop the next world champion Go player.

I’m Erik Stein, a partner at SphereOI. I’ve worked in defense, intelligence, insurance, IoT, precision agriculture and ran a game company for a while. Today I help clients apply AI to solve their mission-critical problems.