# The Quantitative Science of Evaluating Imaging Evidence

## Author + information

- Received December 20, 2016
- Accepted December 21, 2016
- Published online March 6, 2017.

## Author Information

- Tessa S.S. Genders, MD, PhD
^{a}, - Bart S. Ferket, MD, PhD
^{b}and - M.G. Myriam Hunink, MD, PhD
^{c},^{d},^{∗}(m.hunink{at}erasmusmc.nl)

^{a}Duke Clinical Research Institute, Duke University School of Medicine, Durham, North Carolina^{b}Institute for Healthcare Delivery Science, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York^{c}Department of Epidemiology and Radiology, Erasmus University Medical Center, Rotterdam, the Netherlands^{d}Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts

- ↵∗
**Address for correspondence:**

Dr. M.G. Myriam Hunink, Department of Epidemiology and Radiology, Erasmus University Medical Center, P.O. Box 2040, 3000 CA Rotterdam, the Netherlands.

## Central Illustration

## Abstract

Cardiovascular diagnostic imaging tests are increasingly used in everyday clinical practice, but are often imperfect, just like any other diagnostic test. The performance of a cardiovascular diagnostic imaging test is usually expressed in terms of sensitivity and specificity compared with the reference standard (gold standard) for diagnosing the disease. However, evidence-based application of a diagnostic test also requires knowledge about the pre-test probability of disease, the benefit of making a correct diagnosis, the harm caused by false-positive imaging test results, and potential adverse effects of performing the test itself. To assist in clinical decision making regarding appropriate use of cardiovascular diagnostic imaging tests, we reviewed quantitative concepts related to diagnostic performance (e.g., sensitivity, specificity, predictive values, likelihood ratios), as well as possible biases and solutions in diagnostic performance studies, Bayesian principles, and the threshold approach to decision making.

The use of cardiovascular diagnostic imaging tests has increased dramatically in recent decades (1). Although imaging tests have undoubtedly changed the landscape of cardiovascular care, concerns about inappropriate use have prompted clinicians to rethink when imaging is clinically useful, leading, for example, to the development of appropriate use criteria (2,3). Assessment of the appropriateness of using a cardiovascular diagnostic imaging test is challenging, however. It depends on many aspects such as the pre-test probability of disease, diagnostic test performance, the benefit of a correct diagnosis, the harm caused by false-positive test results, and potential adverse effects of performing the test itself (Central Illustration). Thus, critical appraisal and integration of the best available evidence from diagnostic performance studies, (randomized) controlled trials, and cost-effectiveness analyses are crucial to determining the usefulness of a cardiovascular imaging test in a particular patient or population. We provide an overview of the quantitative methods relevant for assessing the clinical value of cardiovascular imaging tests.

## Measures of Diagnostic Performance

### Sensitivity and specificity

Imaging test performance is commonly quantified by sensitivity and specificity. Sensitivity is the proportion of patients with a positive test result among those with the target disease. Specificity is the proportion of patients with a negative test result among those without the target disease (4). For example, Table 1 shows coronary computed tomographic angiography (CTA) results in patients with suspected coronary artery disease (CAD) and disease prevalence of 55% (5). Sensitivity equals 534 of 550 (97%), and specificity equals 392 of 450 (87%).

This review focuses on diagnostic performance measures at the patient level. However, imaging tests for CAD, for example, can also be interpreted at the vessel or coronary artery segment level, resulting in vessel- or segment-level estimates of sensitivity and specificity. When patient-level estimates are of interest in the setting of multiple observations per patient, the correlation between observations within patients should be taken into account in the statistical analysis to prevent bias (6).

#### The reference standard

Sensitivity and specificity are ideally studied in populations for whom the presence or absence of the target disease can be determined with 100% certainty. In reality, the target disease is usually diagnosed by another test that may be imperfect too. Furthermore, patients who undergo the test under study (index test), especially those with negative results, may often forgo confirmation by the reference standard, which can lead to biased estimates of sensitivity and specificity (7). Therefore, when interpreting diagnostic performance measures, critical appraisal of the choice and application of the reference standard is important. For example, the diagnostic performance of myocardial perfusion imaging for the diagnosis of CAD is often studied using the degree of coronary artery narrowing on invasive coronary angiography as the reference standard. This comparison may be suboptimal due to the poor correlation between functional and anatomic abnormalities, but can be appropriate depending on the clinical question and lack of better alternatives (8).

#### The positivity criterion

Although calculating sensitivity and specificity requires that test results are dichotomous, many imaging tests have more than 2 categories of results. For example, the CT coronary artery calcium (CAC) score is expressed by the Agatston score (9), a single numerical value that ranges from 0 up to several thousands. The CAC score can be categorized into groups (e.g., 0, 1 to 9, 10 to 99, 100 to 399, and ≥400) (Table 2) (10). To calculate sensitivity and specificity, we need a positivity criterion—that is, a threshold above which we call the test positive and below which we call the test negative (Figure 1). If 10 is the positivity criterion, the sensitivity is 90% and specificity is 65% (true positives: 235 + 436 + 589 = 1,260, false negatives: 50 + 87 = 137, true negatives: 395 + 1,844 = 2,239, false positives: 667 + 378 + 172 = 1,217). The receiver-operating characteristic (ROC) curve plots the sensitivity and 1 – specificity for each possible threshold. The area under the ROC curve (or C-statistic) is a summary measure often used to compare diagnostic tests independent of the positivity criterion. The area under the ROC curve equals 0.5 for a useless test and approaches 1 for a perfect test (Figure 2).

If therapeutic decisions change on the basis of test positivity, changing the positivity criterion of a cardiovascular diagnostic imaging test can affect patient outcomes. The optimal positivity criterion depends on the pre-test probability of disease, the benefit of a correct diagnosis (true positive), and the harms associated with false-positive results (4,11). For example, coronary CTA may be used to rule out an acute coronary syndrome in patients with low-risk acute chest pain (12). To avoid missing a patient with acute coronary syndrome, CAD of any severity on CTA is usually considered a positive test as opposed to the presence of obstructive CAD. A low threshold for positivity (lenient criterion) yields a high sensitivity, and in turn, a high negative predictive value (NPV). This enables clinicians to rule out an acute coronary syndrome with a high degree of certainty (13).

#### Generalizability issues and possible biases

Sensitivity and specificity are proportions conditional on disease presence or absence, respectively. Therefore, theoretically, they are independent of disease prevalence. However, other factors associated with disease prevalence and test performance act as confounders and can lead to differences in diagnostic performance measures (14,15). A population with a high prevalence of disease may include relatively more patients with severe disease or comorbidities, representing a more severe case mix. The difference in test performance due to differences in case mix is referred to as spectrum bias (15). For example, the sensitivity of CTA for diagnosing CAD is likely to be higher in patients with severe 3-vessel disease than in patients with mild-to-moderate single-vessel disease, simply because severe disease is easier to detect. Although considered a “bias,” it may also be viewed as a matter of generalizability. Differences in disease prevalence can also affect sensitivity and specificity if readers change their (implicit) threshold for test positivity on the basis of the prevalence of disease they expect to see in their population, a phenomenon referred to as reader expectation (14). Because of the aforementioned issues, it is recommended to consider the makeup and prevalence of disease in diagnostic performance studies and to use caution when extrapolating results to other populations.

The methodological quality of scientific evidence can be assessed using available reporting standards (16–30) (Table 3) and can help identify potential biases. Flaws in diagnostic performance study designs (e.g., including nonconsecutive patients, selective referral, and use of an imperfect reference standard) can lead to bias (14,31). For example, using a case-control design to study diagnostic performance where (severe) cases are selected separately from (healthy) control cases artificially inflates sensitivity and specificity (32).

**Verification bias**

In a setting where patients with negative imaging test results are less likely to undergo verification by the reference standard, sensitivity and specificity may be biased (33,34). This phenomenon is referred to as partial verification bias, work-up bias, or referral bias. In other cases, investigators may choose an alternative (often less perfect) reference standard for those who are not referred for the conventional reference standard. If the alternative reference standard is inferior to the conventional reference standard but assumed to provide the same diagnostic information, differential verification bias may occur (35). For the purpose of this review, we focus on partial verification bias (Figure 3).

A simple method to correct for partial verification bias uses the inverse of the probability of verification depending on whether the test is positive or negative (i.e., inverse probability weighting) (33). If it is known that 10% of patients with a negative test result and 100% of patients with a positive test result are referred for the reference standard test, a reverse calculation can be performed by inflating the number of patients with negative test results by the inverse of the probability of verification (1/10% = 10) (Figure 3). However, this method assumes that the decision for verification is based solely on the index test result and that the positive predictive value (PPV) and NPV remain the same. This method can be extended to include other variables that influence the decision to verify by calculating the probability of verification for individual patients using multivariable logistic regression analysis and performing patient-specific inverse probability weighting.

An alternative method to correct for partial verification bias uses multiple imputation (36,37). Imputation models can be used to predict the reference standard result on the basis of clinical information for those who remain unverified. In contrast to single imputation methods where only a single value for the test result is predicted and imputed, multiple imputation methods predict a range of possible test result values (10). Diagnostic test performance can be calculated for each imputed dataset separately (each dataset contains the same nonmissing data but differs with respect to the imputed missing values), and the imputed datasets can be combined to obtain point estimates with confidence intervals that include the uncertainty related to the imputations. This method assumes that missing data occur randomly conditional on the available data. It also assumes that all variables associated with the reference test result are available and that the model for predicting the test result is correctly specified.

## The Clinical Value of a Diagnostic Imaging Test

### Bayesian principles

The Bayes theorem describes the probability of a certain event conditional on another related event. In diagnostic imaging research specifically, it dictates how much a previous belief (pre-test probability) should change on the basis of imaging information, resulting in a posterior belief (post-test probability) (Figure 4). In mathematical terms, the pre-test odds can be multiplied by the positive likelihood ratio (LR) for a positive test result and the negative LR for a negative test result to obtain the post-test odds. Odds and probabilities both express chance and can be converted back and forth (Online Appendix).

#### Likelihood ratio

LR is an alternative measure to quantify diagnostic performance. It is a measure of how much the odds of disease change on the basis of a test result and equals the probability of a test result in patients with the disease divided by the probability of that same test result in patients without the disease (Online Appendix). If the LR is 1, then the test result is equally likely to occur in either group—that is, the test result does not give us any additional information as to whether disease is present. When the LR is >1, the test result argues in favor of the presence of disease, whereas an LR <1 argues against the presence of disease. To calculate the positive LR, the probability of a positive test is used, whereas for the negative LR, the probability of a negative test is used (Online Appendix).

LR are convenient when applying Bayes theorem. By multiplying the pre-test odds with the LR, the post-test odds are obtained. LR of several imaging test results—even obtained in separate populations—can be combined by multiplication, as they do not depend on the pre-test likelihood. When multiplying LR, the inherent assumption is that the test results are “conditionally independent”—that is, given the presence (or absence) of disease, the results of the different tests are uncorrelated (4). For cardiovascular imaging tests, however, this assumption is often not met. For example, the odds of obstructive CAD in a patient with a positive coronary CTA and a positive stress echocardiography (echo) are likely to be overestimated by pre-test odds × LR_{positive CTA} × LR_{positive echo}. CTA and echo are not conditionally independent, because given the presence of CAD, a patient is more likely to have a positive echo and more likely to have CAD on CTA. Multiplying LR in this case will overestimate the diagnostic information gained. However, in the case of combining a patient’s physical exam findings and laboratory markers as “independent diagnostic tests” to diagnose acute coronary syndrome, LR can be multiplied to assess their joint effect on likelihood of disease (38,39).

#### PPV and NPV

Clinical decision making for patients cannot be based solely on test characteristics such as sensitivity and specificity because they reflect the probability of a test result conditional on disease presence and absence, respectively. In contrast, predictive values provide the probability of disease given the imaging test results. PPV is defined as the probability of disease among patients with positive test results. NPV is defined as the probability of absence of disease among patients with negative test results.

#### The pre-test probability

The pre-test probability of disease is the probability that the patient has the target disease conditional on the available clinical information before the index test is performed. The pre-test probability can be calculated on the basis of clinical judgment, the mean prevalence of the population from which the patient is drawn, or clinical variables with or without information from previous test results. For example, multivariable prediction models for the presence of CAD can be used to calculate the pre-test probability of CAD on the basis of patient-specific clinical variables with or without the CAC score (10). The pre-test probability affects the PPV and NPV of a diagnostic test and thus determines the usefulness of an imaging test.

For example, coronary CTA is an excellent diagnostic imaging test for the diagnosis of CAD with a sensitivity of 98% and specificity of 89% compared with invasive coronary angiography (5). PPV of a positive CTA is 69% when the pre-test probability is 20%, whereas the PPV is 97% when the pre-test probability is 80% (Figure 5). For easy calculations and interpretations of predictive values, we recommend constructing a 2 × 2 table by fixing the column totals using the pre-test probability, filling in the true-positive, false-negative, false-positive, and true-negative numbers using sensitivity and specificity, summing the rows to obtain total positive test results and total negative test results, and subsequently deriving predictive values (Table 1).

### Deciding when to image

#### Clinical trials

Clinical trials, specifically randomized controlled trials (RCT), have become a major focus in cardiovascular imaging research (40–42). They are considered to provide the highest level of evidence regarding the effectiveness of diagnostic strategies in terms of clinical outcomes. For example, the PROMISE (Prospective Multicenter Imaging Study for Evaluation of Chest Pain) trial was a pragmatic multicenter RCT that compared outcomes of initial testing with CTA versus initial functional testing for symptomatic patients with suspected CAD (40). The PROMISE trial did not show a difference in clinical outcomes (composite endpoint including death, myocardial infarction, hospitalization for unstable angina, or major procedural complication) over a median follow-up of 2 years. Similarly, the CRESCENT (Computed Tomography vs. Exercise Testing in Suspected Coronary Artery Disease) trial compared initial functional testing with a tiered cardiac CT approach in which the diagnostic work-up differed depending on the CAC score (42). The CRESCENT trial showed an increase in event-free survival after 1.2 years and decreased diagnostic expenses for patients in the CT arm.

#### The threshold approach to decision making

The goal of a diagnostic imaging test is to gain more diagnostic certainty and to decide whether a patient should receive treatment or not. For any given cardiovascular disease, clinicians may use a treatment threshold (i.e., the threshold above which the probability of disease is high enough to warrant treatment and below which treatment is deemed unnecessary). For example, consider a patient with anginal symptoms for whom medical risk factor management (e.g., aspirin, statins, beta-blockers) is considered. Medical management can reduce the patient’s risk of cardiovascular events, especially if the patient has underlying CAD. On the other hand, medical therapy is associated with adverse effects (e.g., gastrointestinal bleeding, myopathy, syncope). The trade-off depends on the balance of the harms and benefits and the probability that CAD is present. Treatment is justified when the probability of disease is high enough for the benefits to outweigh the risks, and conversely treatment should be withheld when the risks outweigh the benefits (e.g., in patients with a very low probability of CAD). The net harm of treatment is defined as the difference in outcome between treatment and no treatment for patients without CAD (probability of CAD = 0), whereas the net benefit of treatment is defined as the difference in outcome between treatment and no treatment for patients with CAD (probability of CAD = 1). The ratio between the harm and benefit is proportional to the odds at the no treat–treat threshold (Figure 6) (43). The outcome that is used to determine the benefit and harm of treatment can be defined in various ways, but ideally includes long-term effects (e.g., event-free survival, quality-adjusted life-expectancy, as well as costs) (4,43).

In general, diagnostic imaging tests are most informative for patients with a pre-test probability near the treatment threshold, because the result of the diagnostic imaging test can “move” the patient’s probability across the threshold and alter clinical management. The no treat–test threshold and the test-treat threshold define the range of probabilities for which diagnostic testing is useful, and these thresholds depend on the benefit and harm of treatment, diagnostic test performance, and the risk of the test (Figure 7) (4,43).

#### Decision modeling

Decision analysis can be used to determine testing thresholds by comparing outcomes for different diagnostic strategies at different pre-test probabilities, such as for patients with stable chest pain (44). Modeling approaches can incorporate multiple strategies and clinical outcomes can be extrapolated over extended time periods. They allow the integration of the results of clinical trials and observational studies related to diagnostic test performance, treatment effects (e.g., reduction in cardiovascular event rates and improvements in quality of life), and costs so that all the relevant benefits and harms can be weighed at the same time (Central Illustration). Modeling approaches also have limitations, and simulation models are only as good as the data used to populate them and the necessary assumptions. However, in the absence of decisive answers from imaging RCT, decision-analytic models can provide guidance for clinical decision making.

Decision models can also be used to determine the optimal positivity criterion in the context of a test with multiple test results and when using prediction models (4). For example, if we want to determine what CAC score should lead to further work-up and treatment, we can calculate the optimal cutoff point taking into account the pre-test probability of disease, the expected net benefit (and net cost, if relevant) of correctly diagnosing the disease (true positives), and the expected net harms (and net cost) associated with false-positive results (4). The optimal operating point will shift to a less stringent positivity criterion (i.e., higher sensitivity and lower specificity, lower LR, and flatter ROC slope) if the probability of disease is higher, the net benefit (and cost savings) of diagnosing the disease is larger, or the net harm (and induced costs) associated with false-positive results is smaller.

## Discussion

Sensitivity and specificity are commonly used measures and are important to quantify diagnostic performance and to compare diagnostic performance across imaging modalities. Clinicians should be aware, however, that sensitivity and specificity are calculated conditional on the presence or absence of disease and therefore cannot be applied directly to patients. An estimate of the pre-test probability of disease is needed to calculate predictive values, reflecting a patient’s post-test probability of disease, which is more relevant clinically. The same calculations are more conveniently performed using LR. However, assessment of the overall clinical value or appropriateness of a cardiovascular imaging test should also include consideration of the long-term benefits and harms of testing, outcomes that go beyond disease likelihood.

Although RCT are considered to provide the highest level of evidence, because they are able to provide unbiased estimates of harms and benefits, RCT have important limitations that deserve mentioning. First, imaging RCT often compare only 2 or 3 diagnostic strategies over a limited follow-up time. In reality, many more diagnostic strategies are possible, and relevant differences in outcomes following randomization may accrue over many years beyond trial completion. The optimal decision regarding the use of a cardiovascular imaging test ideally integrates the best available evidence for multiple endpoints (e.g., cardiovascular events, radiation-induced cancer), which should be valued differently (e.g., according to their impact on quality of life, and costs to society or the health system), whereas imaging RCT often focus on 1 (composite) outcome at a time. These important limitations can be partly overcome by using decision analysis and modeling techniques.

Our review demonstrates that the clinical value of a cardiovascular imaging test depends on many variables that go beyond its diagnostic performance (i.e., sensitivity and specificity), such as the pre-test probability of disease, the benefit of making a correct diagnosis, the harm caused by false-positive imaging test results, and potential adverse effects of performing the test itself (Central Illustration). In clinical practice, individual patient factors—such as the (in)ability to exercise, body habitus, renal insufficiency, or claustrophobia—may also determine which test (not) to use. As a consequence, the ultimate goal would be to synthesize all these factors into the decision making for each unique patient. This requires an integrated and personalized approach by placing cardiovascular imaging in a broader context of individualized long-term outcomes.

## Appendix

## Appendix

For supplemental material, please see the online version of this paper.

## Footnotes

This work was supported by an American Heart Association Grant #16MCPRP31030016 (Dr. Ferket). Dr. Hunink has received royalties for textbook from Cambridge University Press; grants and nonfinancial support from European Society of Radiology; and nonfinancial support from European Institute for Biomedical Imaging Research, outside the submitted work. All other authors have reported that they have no relationships relevant to the contents of this paper to disclose. Pamela Douglas, MD, served as the Guest Editor for this paper.

- Abbreviations and Acronyms
- CAC
- coronary artery calcium
- CAD
- coronary artery disease
- CTA
- computed tomographic angiography
- LR
- likelihood ratio
- NPV
- negative predictive value
- PPV
- positive predictive value
- RCT
- randomized controlled trial(s)
- ROC
- receiver-operating characteristic

- Received December 20, 2016.
- Accepted December 21, 2016.

- American College of Cardiology Foundation

## References

- ↵
- Lucas F.L.,
- DeLorenzo M.A.,
- Siewers A.E.,
- Wennberg D.E.

- ↵
- ↵
- Carr J.J.,
- Hendel R.C.,
- White R.D.,
- et al.

- ↵
- Hunink M.G.,
- Weinstein M.C.,
- Wittenberg E.,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- Agatston A.S.,
- Janowitz W.R.,
- Hildner F.J.,
- Zusmer N.R.,
- Viamonte M. Jr..,
- Detrano R.

- ↵
- Genders T.S.,
- Steyerberg E.W.,
- Hunink M.G.,
- et al.

- ↵
- Phelps C.E.,
- Mushlin A.I.

- ↵
- ↵
- Hoffmann U.,
- Bamberg F.,
- Chae C.U.,
- et al.

- ↵
- ↵
- ↵
- Bossuyt P.M.,
- Reitsma J.B.,
- Bruns D.E.,
- et al.,
- for the STARD Group

- ↵
- ↵
- Burda B.U.,
- Holmer H.K.,
- Norris S.L.

- ↵
- ↵
- ↵
- ↵
- Schulz K.F.,
- Altman D.G.,
- Moher D.,
- CONSORT Group

- ↵
- Moher D.,
- Hopewell S.,
- Schulz K.F.,
- et al.

- ↵
- ↵
- ↵
- ↵
- Husereau D.,
- Drummond M.,
- Petrou S.,
- et al.

- ↵
- ↵
- ↵
- Brouwers M.C.,
- Kho M.E.,
- Browman G.P.,
- et al.

- ↵
- ↵
- Rutjes A.W.,
- Reitsma J.B.,
- Vandenbroucke J.P.,
- Glas A.S.,
- Bossuyt P.M.

- ↵
- ↵
- de Groot J.A.,
- Bossuyt P.M.,
- Reitsma J.B.,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- Neutra R.

- ↵
- ↵
- ↵
- Lubbers M.,
- Dedic A.,
- Coenen A.,
- et al.

- ↵
- ↵