Will I benefit from an independent model assessment?

Share it:

3 min
The team has built an AI model and performance appears adequate. But there is a lot riding on the deployment. Do I need independent model test and evaluation? When is the cost justified?

Many AI Engineering teams, especially newer ones, often fail to realize the many ways in which bias and performance limits can be designed or trained into an AI model. Sometimes teams, in the press to develop high-performing models, inadvertently lose a degree of objectivity and inadvertently bias or overfit their models thus causing significant degradation in model accuracy after deployment. This not only risks the investment in the model but also on the system(s) dependent upon the model outputs.

When the risk of introducing bias, insufficient model accuracy or data drift is consequential, an independent test and evaluation is warranted. Consequential differs from organization to organization, but generally means it is outside the organizationโ€™s appetite for risk. An independent test and evaluation offers due diligence to mitigate the risk of AI/ML models negatively impacting mission/business goals.

Bias is when an AI/ML model was trained such that its predictions/classifications/labeling assigns values to subclasses of the data in proportions that are not reflected in the production data. Model accuracy is the ability of the model to perform its task correctly to meet mission/business needs. Data drift, also referred to as model drift, is when aspects of the data change in ways the model did not learn and model performance degrades.

Some AI/ML systems have very little risk in these areas. Take a simple you-may-also-like product recommender, for example, which simply finds other products whose descriptions are related to products at the top of the results for the usersโ€™ search. The model identifies similar or related products based solely on the product descriptions themselves so there is no opportunity to produce biased results influenced by demographic information about the user. Model accuracy needs to be sufficient to provide useful results but can tolerate some inaccuracy in both failing to retrieve every similar product and selecting products that are only marginally related. The model is updated as product offerings change but needs no other maintenance to handle changes in data over time.

But consider a resume classifier that, given a job description, searches the web for resume authors and classifies resumes as good candidates for the job or bad candidates based on both the resume and information gleaned from the web. This use of personal information affords lots of opportunity to bias classifier results. Imagine a term or phrase that has become popular in the last five years and the author of the job description uses it. The resume classifier might prefer candidates whose resumes contain that term/phrase and might unintentionally generate age-discriminatory results.

While the simple product recommender can tolerate some inaccuracy, a system to identify targets on the battlefield cannot. A classifier to determine threat/non-threat or target/not-target must 1) identify everything that is a threat and 2) not identify non-threats. Bias is not a concern in this case, but exceptional, generalized model accuracy is paramount.

Even when sufficient model accuracy has been achieved and the model deployed to production, there is a chance that data will change over time. This is referred to as drift. As data drift over time, it may not have the same statistical properties that it had when the model was trained on it causing model accuracy to degrade over time. A recent, expensive example of this is a model Zillow built to identify properties for purchase. However, rapidly changing conditions in the housing market were never learned by the model and its predictions were overestimating property values.

An AI application may be at risk for any single factor or combination of factors. The risks come from multiple sources: choices of model architecture, training data, training and validation regimens, lack of monitoring or immature MLOps practices.

Our AI engineering practice at SphereOI has matured over the years with the development of many different types of models in many different industries and domains. Not only do we develop AI solutions for clients, but we also provide independent test and evaluation assessments to clients to mitigate the risks originating from these sources.