Here, we discuss how to perform a validation, how to calculate the Standard Error of Prediction, and the best metric of calibration quality.
Now that we have studied how to obtain a calibration, determined its quality, and eliminated outliers, the next step is to validate the calibration. A validation is an independent test of the quality and robustness of a calibration and is the best measure we know of to determine calibration quality. How to perform a validation and how to calculate the Standard Error of Prediction, the best metric of calibration quality, will be discussed.
The art and science of cannabis analysis involves the measurement of many different quantities, including cannabinoids, terpenes, mold, moisture, pesticides, and heavy metals. In all these cases an instrument is used to measure these quantities, and the instrument must be calibrated with standard samples with a known amount of the analyte of interest. The process of calibration then is where the relationship between the instrument response is correlated with the known quantity of analyte in the standards (1-7).
In a previous column (2) we introduced the concept of variance, which is a measure of the spread or scatter in a set of data. In our case, this could be the known tetrahydrocannabinol (THC) concentrations in a set of standards. We then calculated a bunch of “sums of squares” including the sum of squares total (SST) which measures the total amount of variance, the sum of squares modeled by a regression model or SSR, and the amount of variance not modeled or the error in a model denoted by SSE (2).
A commonly used measure of calibration model quality is the Standard Deviation as defined in Equation 1, where σ is the standard deviation, SSE is the sum of squares due to error (variance not modeled), and n is the number of data points.
Note that the standard deviation contains the SSE, the amount of error in a model, divided by n, the number of data points in a data set. In our case n can be, for example, the number of THC standards used in calibrating a chromatograph or spectrometer for potency analyses. The Standard Deviation then can be thought of as the average error per standard sample.
The Correlation Coefficient, R, is a measure of model quality and is given by Equation 2 where R is the correlation coefficient, SSR is the sum of squares due to regression (variance modeled), and SST is the sum of squares total (total variance).
Since R depends upon the ratio of variance explained by a calibration model divided by the total variance, we can think of it as a measure of the fraction of variance modeled. Thus, for a perfect model R = 1, and for the worst possible model R = 0. Thus, R is on a zero to one scale. For our purposes, where the calibration model is usually a plotted line, R is a measure of model linearity.
Lastly, the F for regression or robustness of a model, is given by Equation 3 where F is the F for regression, SSR is the sum of squares due to regression (variance modeled), SSE is the sum of squares due to error (variance not modeled), n is the number of data points (number of samples), and m is the number of independent variables, in our case 1.
Robustness is a measure of how sensitive a model is to small changes in input data. Note that F depends upon SSR/SSE which can be thought of as a signal-to-noise ratio for a calibration. In a robust model a small increase in error would result in a negligible decrease in calibration quality. Whereas, for a non-robust model a small increase in error would lead to a significant deterioration in model quality. The bigger F is the more robust and hence a better model.
In a previous column, we introduced a real-world calibration where the peak areas in mid-infrared spectra were used to quantitate the amount of isopropyl alcohol (IPA) dissolved in water (3). The peak area and known volume % IPA data are seen in Table 1, and the resultant calibration line is seen in Figure 1.
From previously (1-3) the Equation 4 for our IPA in water calibration is given by and is also seen at the top of Figure 1.
Since peak area is plotted on the y-axis, we can exchange y for A in Equation 4, and since the concentration of IPA is plotted on the x-axis, we can exchange x for C in Equation 4 to obtain Equation 5 where A is the peak area (absorbance) and C is the concentration.
Our goal in performing a calibration is to predict the concentration of analyte in a standard. Thus, we can rearrange Equation 5 to solve for C as shown in Equation 6:
A test of calibration quality would be to take the peak areas measured for the standards as listed in Table 1, plug them into Equation 5, and then generate a series of predicted concentrations. We can then calculate for the calibration what is called the standard error of calibration, or SEC for short, using Equation 7, where i is the index over the number of calibration samples, Pi is the predicted concentration for a calibration sample, Ki is the known concentration for a calibration sample, and n is the number of calibration samples.
Note that Equation 7 has similarities to Equation 1, and in fact the SEC is simply the standard deviation of the predicted minus known concentration values for the calibration sample set.
To be clear the SEC tells us how well the calibration does on samples it has seen before. For example, they are included in the calibration. In real life, calibration lines are applied to unknown samples not standards, so the SEC is not the most realistic measure of calibration quality. This is where validation comes in.
The process of validation means that a calibration is applied to the instrument response of standard samples of known analyte concentrations that were not used in the calibration. We will call these samples validation samples. Validation is the best measure of calibration quality because it mimics what a calibration does once implemented, be applied to samples it has not seen before, such as unknowns. Thus, a validation shows us how we well we can expect a calibration to function in the real world.
Thus, to validate our IPA calibration we would measure the peak area for a series of validation samples, predict the IPA concentration, and then compare the real and predicted concentrations. The question of how many validation samples to use naturally crops up. In general, you want at least 10% as many validation samples as calibration samples with a minimum of three as that is the smallest amount of samples for which we can calculate validation metrics.
Information and results for the validation of our IPA in water calibration is seen in Table II. Note in Table I above that there are five calibration samples. Using the formulas mentioned above we should then have 0.5 validation samples, which is of course silly. Thus, the minimum number of validation samples, three, was chosen. Table 2 shows known IPA concentration for the validation set in the left column, the resultant measured peak area in the second column, the predicted concentrations using Equation 5 in the third column, and the predicted minus known %IPA values for the validation samples in the rightmost column.
Note there is a metric called the bias calculated in the table. This is a calibration metric I probably should have introduced earlier, but better late than never. In everyday life we may accuse someone, or a news story, of being biased if the truth is skewed intentionally one way or another. In calibration models bias is a measure of the amount of systematic error in a calibration model because it measures how far and in what direction each measurement is skewed by on average. We have discussed systematic error previously (4). Bias is calculated by summing the (predicted – known) values for a series of validation samples as given by Equation 8 where i is the index over the number of validation samples; Pi is the predicted concentration for a validation sample; and Ki is the known concentration for a validation sample.
In our validation i goes from 1 to 3. In Table II, the Bias calculation returned a value -0.3% IPA. This means that on average our calibration predicts IPA in water values that are low by 0.3%. This bias can be corrected in predicted sample values by simply adding 0.3% to each measurement.
Also noted in Table II, is a thing labeled “SEP” which stands for Standard Error of Prediction and is given by Equation 9 where i is the index over the number of validation samples; Pi is the predicted concentration for a validation sample; Ki is the known concentration for a validation sample; and n is the number of validations samples.
Where again in our example i varies from 1 to 3 and n = 3. Note that Equation 9 is similar in form to Equation 1 for the Standard Deviation and Equation 6 for the SEC. The SEP is in fact the standard deviation of the predicted minus known data for the validation sample set. The formulas for SEC and SEP are identical except that the former uses calibration samples and the latter uses validation samples.
To be clear, the SEP is calculated for validation samples only. Thus, in my opinion it is the best measure of calibration quality because it tells us how well a calibration does on samples it has not seen before, exactly what a calibration is expected to do in real life. With proper rounding then we would say that the SEP for our example calibration is 0.96% IPA, which is pretty darn good in my opinion.
We have said much in this Calibration Science column series about how to plot calibration lines and to calculate their metrics (1-5). However, did you know that validation lines can be plotted as well? A validation line is simply a plot of predicted versus known concentrations for a set of validation samples. The validation line for the data in Table II is seen in Figure 2.
Note in the upper left-hand corner we have calculated the equation of the line and its correlation coefficient, R2. Typically, then, we will have a calibration line with its own metrics, including an R2 and a standard error of calibration, and a validation line with its own R2 and standard error of prediction. The metrics for the validation line are the best measure of calibration quality because again the validation mimics what calibrations do in real life, predict the concentration of an analyte in an unknown sample.
We don’t have space to go into it here, but the correlation coefficient depends upon data structure, that is how the concentrations of the standard sample are spread out across their range. For our example, the calibration seen in Figure 1, ideally all 5 standard concentrations would be evenly spaced. This is not always possible, and changing how the standard concentrations are scattered across their range will change R all other things being equal. Thus, R is not as good a predictor of model quality as SEP.
In the final analysis then we want a model that minimizes SEP, maximizes robustness, while hopefully giving us a good R value.
We reviewed what a calibration is and the calibration metrics of standard deviation, correlation coefficient, and robustness. We introduced the calibration metric of the Standard Error of Calibration. The concept of validation was discussed and we applied it to our example IPA in water system. We found that the bias in a validation is a measure of systematic error and can be corrected for, and the Standard Error of Prediction, SEP, is the best predictor of calibration quality as it tells us how well a given calibration model performs. In the end the best calibration will minimize SEP while maximizing robustness.
References
How to Cite this Article
Smith, B., Calibration Science, Part VI: Validation, Cannabis Science and Technology, 2024, 7(5), 6-9.
Brian C. Smith, PhD, is Founder, CEO, and Chief Technical Officer of Big Sur Scientific. He is the inventor of the BSS series of patented mid-infrared based cannabis analyzers. Dr. Smith has done pioneering research and published numerous peer-reviewed papers on the application of mid-infrared spectroscopy to cannabis analysis, and sits on the editorial board of Cannabis Science and Technology. He has worked as a laboratory director for a cannabis extractor, as an analytical chemist for Waters Associates and PerkinElmer, and as an analytical instrument salesperson. He has more than 30 years of experience in chemical analysis and has written three books on the subject. Dr. Smith earned his PhD on physical chemistry from Dartmouth College.
Direct correspondence to: brian@bigsurscientific.com.