Machine Learning Model Validation

Method to assess machine learning behaviour with data different to the training one

Method Purpose Short description of the method's purpose and main benefits

Method description A detailed description of the method

Machine learning (ML) algorithms have been becoming pervasive in many technological fields, and they are both surpassing some traditional techniques as well as offering breaking solutions to previously intractable problems. A key element in ML is that the models learn from training data rather than being explicitly designed and tuned by engineers. This same key element poses high challenges in verifying and validating a trained model with regards to its effectiveness and safety.

On one hand, models have generic structures which represent no physical processes/laws, and on the other hand, the data used to train and validate the models must rely on human effort to manually and informally curate the input data and define the expected output (supervised learning), rendering many of the classical V&V methods unsuitable.

The usual supervised training process relies on large amounts of data that includes system inputs and respective output and optimizes the parameters in order to fit the model and minimize a “loss” metric. But fitting the model to training data does not validate that the model will perform as desired outside that training domain. For that purpose, “testing data” that is never used to train the model is used to validate the model generalization to expected scenarios. Besides that, “validation data” can also be used to guide the model high-level architecture design. Hence three main categories of datasets are used [MLV1, MLV2]:

Training datasets: data used to train and optimize the ML model parameters;
Validation dataset: sample of data used to evaluate the trained model and help the design of the ML model structure (e.g., number of neural network layers, training procedures, etc.), the so-called hyper parameters. These datasets also introduce bias to the model as the model is tailored to favour them; - Note: the term “validation” used in here is distinct to the term used in the context of V&V
Test dataset: the sample of data used to estimate the real, unbiased performance of the model when applied to novel input.

Model validation relying on the provision of multiple datasets can then be tailored with multiple variants of the validation technique, including holdout, cross-validation, random subsampling, and bootstrapping.

Due to being a novel technology, the regulatory requirements for the V&V of ML systems are still in the phase of drafting and designing by the standards organizations and the regulatory authorities, for example, in the healthcare domain [MLV3, MLV4].

Method Strengths A listof the strengths of the method

Does not require interpretability of the models, i.e., it allows black-box models (although it can be used in interpretable models too)
The data and the procedures used to create the model have the same nature of the data and procedures used to validate it
Enables assimilation of validation data into the model in order to further enhance it, being especially important when finding input data where the model is failing

Method Limitations The limitations of the method

Usually requires manual effort to define expected output (especially in supervised machine learning)
Does not offer formal guarantees of effectiveness or safety
Cannot evaluate the model behaviour on challenging input not yet existent

Method References Bibliography references to papers about the method.

[MLV1] Buduma, N., and N., Locascio. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms. O'Reilly Media, 2017
[MLV2] Goodfellow, I., Y., Bengio, and A., Courville. Deep Learning. MIT Press, 2016
[MLV3] Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD). FDA, 2019
[MLV4] Regulatory Guidelines for Software Medical Devices – A Life Cycle Approach. Health Sciences Authority - Government of Singapore, 2020 (section 8. Artificial Intelligence Medical Devices (AI-MD))

Method Dimensions

Evaluation Environment Type Type of environment where the method is applied. It can be more than one type of environment. (Dimension 1)

Evaluation Type Type of evaluation performed with that method. If a method is hybrid, more than one type can be selected. (Dimension 2)

Type of Component Under Evaluation Type of components that can be evaluated with that method. It can be more than one type. (Dimension 3)

Evaluation Stage The evaluation stage where the method is used. (Dimension 5)

Purpose of the Component Under Evaluation The type of purpose that can be evaluated by a method. It can be more than one type. (Dimension 6)

Type of Requirement Under Evaluation The type of requirements that a method is able to evaluate. It can be more than one type. (Dimension 7)

Evaluation Performance Indicator The type of evaluation criteria or KPI that a method is able to improve. It can be more than one type. (Dimension 8)

Relations

Related Methods A method could be related to other methods.

Tools A method can use zero or more tools.

Part Method A method (when it is a combination: Combination of method artefact that is a specialization of a V&V Method) could be composed of a set of V&V Methods.

Test Case or Verification and Validation activity Test Case or V&V activity performed in a Use Case. We will link to the method to the test cases where the method is used.

Standards A method can be linked with standards.

Context

Workflow

Contents

There are currently no items in this folder.