Machine Learning Model Validation
Machine learning (ML) algorithms have been becoming pervasive in many technological fields, and they are both surpassing some traditional techniques as well as offering breaking solutions to previously intractable problems. A key element in ML is that the models learn from training data rather than being explicitly designed and tuned by engineers. This same key element poses high challenges in verifying and validating a trained model with regards to its effectiveness and safety.
On one hand, models have generic structures which represent no physical processes/laws, and on the other hand, the data used to train and validate the models must rely on human effort to manually and informally curate the input data and define the expected output (supervised learning), rendering many of the classical V&V methods unsuitable.
The usual supervised training process relies on large amounts of data that includes system inputs and respective output and optimizes the parameters in order to fit the model and minimize a “loss” metric. But fitting the model to training data does not validate that the model will perform as desired outside that training domain. For that purpose, “testing data” that is never used to train the model is used to validate the model generalization to expected scenarios. Besides that, “validation data” can also be used to guide the model high-level architecture design. Hence three main categories of datasets are used [MLV1, MLV2]:
- Training datasets: data used to train and optimize the ML model parameters;
- Validation dataset: sample of data used to evaluate the trained model and help the design of the ML model structure (e.g., number of neural network layers, training procedures, etc.), the so-called hyper parameters. These datasets also introduce bias to the model as the model is tailored to favour them; - Note: the term “validation” used in here is distinct to the term used in the context of V&V
- Test dataset: the sample of data used to estimate the real, unbiased performance of the model when applied to novel input.
Model validation relying on the provision of multiple datasets can then be tailored with multiple variants of the validation technique, including holdout, cross-validation, random subsampling, and bootstrapping.
Due to being a novel technology, the regulatory requirements for the V&V of ML systems are still in the phase of drafting and designing by the standards organizations and the regulatory authorities, for example, in the healthcare domain [MLV3, MLV4].
- Does not require interpretability of the models, i.e., it allows black-box models (although it can be used in interpretable models too)
- The data and the procedures used to create the model have the same nature of the data and procedures used to validate it
- Enables assimilation of validation data into the model in order to further enhance it, being especially important when finding input data where the model is failing
- Usually requires manual effort to define expected output (especially in supervised machine learning)
- Does not offer formal guarantees of effectiveness or safety
- Cannot evaluate the model behaviour on challenging input not yet existent
- [MLV1] Buduma, N., and N., Locascio. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms. O'Reilly Media, 2017
- [MLV2] Goodfellow, I., Y., Bengio, and A., Courville. Deep Learning. MIT Press, 2016
- [MLV3] Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD). FDA, 2019
- [MLV4] Regulatory Guidelines for Software Medical Devices – A Life Cycle Approach. Health Sciences Authority - Government of Singapore, 2020 (section 8. Artificial Intelligence Medical Devices (AI-MD))