Author + information
- Received September 7, 2011
- Revision received July 5, 2012
- Accepted July 9, 2012
- Published online October 1, 2012.
- Sukhjinder S. Nijjer, BSc, MBChB⁎,⁎ (, )
- Punam A. Pabari, MBChB, PhD†,
- Berthold Stegemann, PhD‡,
- Vittorio Palmieri, MD, PhD§,
- Francisco Leyva, MD∥,
- Cecilia Linde, MD, PhD¶,
- Nick Freemantle, PhD#,
- Justin E. Davies, BSc, MBBS, PhD⁎,
- Alun D. Hughes, MBBS, PhD⁎⁎ and
- Darrel P. Francis, MA, MD⁎⁎
- ↵⁎Reprint requests and correspondence:
Dr. Sukhjinder S. Nijjer, Department of Cardiology, Hammersmith Hospital, Du Cane Road, London W12 0HS, United Kingdom
Objectives We sought a method for any reader to quantify the limit, imposed by variability, to sustainably observable R2 between any baseline predictor and response marker. We then apply this to echocardiographic measurements of mechanical dyssynchrony and response.
Background Can mechanical dyssynchrony markers strongly predict ventricular remodeling by biventricular pacing (cardiac resynchronization therapy)?
Methods First, we established the mathematical depression of observable R2 arising from: 1) spontaneous variability of response markers; and 2) test–retest variability of dyssynchrony measurements. Second, we contrasted published R2 values between externally monitored randomized controlled trials and highly skilled single-center studies (HSSCSs).
Results Inherent variability of response markers causes a contraction factor in R2 of 0.48 (change in left ventricular ejection fraction [ΔLVEF]), 0.50 (change in end-systolic volume [ΔESV]), and 0.40 (change in end-diastolic volume [ΔEDV]). Simultaneously, inherent variability of mechanical dyssynchrony markers causes a contraction factor of between 0.16 and 0.92 (average, 0.6). Therefore the combined contraction factor, that is, limit on sustainably observable R2 between mechanical dyssynchrony markers and response, is ∼0.29 (ΔLVEF), ∼0.24 (ΔESV), and ∼0.30 (ΔEDV). Many R2 values published in HSSCSs exceeded these mathematical limits; none in externally monitored trials did so. Overall, HSSCSs overestimate R2 by 5- to 20-fold (p = 0.002). Absence of bias-resistance features in study design (formal enrollment and blinded measurements) was associated with more overstatement of R2.
Conclusions Reports of R2 > 0.2 in response prediction arose exclusively from studies without formally documented enrollment and blinding. The HSSCS approach overestimates R2 values, frequently breaching the mathematical ceiling on sustainably observable R2, which is far below 1.0, and can easily be calculated by readers using formulas presented here. Community awareness of this low ceiling may help resist future claims. Reliable individualized response prediction, using methods originally designed for group-mean effects, may never be possible because it has 2 currently unavailable and perhaps impossible prerequisites: 1) excellent blinded test–retest reproducibility of dyssynchrony; and 2) response markers reproducible over time within nonintervened individuals. Dispassionate evaluation, and improvement, of test–retest reproducibility is required before any further claims of strong prediction. Prediction studies should be designed to resist bias.
Biventricular pacing is thought to deliver benefit in heart failure through resynchronization of dyssynchronous cardiac mechanical function, hence the term cardiac resynchronization therapy (1–6). Some studies (7–9) demonstrate strong relationships (high coefficient of determination, R2 values) between baseline mechanical dyssynchrony and echocardiographic outcome measures, whereas others (10–12) show much weaker relationships. Most guidelines for selecting patients for biventricular pacing emphasize electrical dyssynchrony manifested as wide QRS duration rather than mechanical dyssynchrony (13), although pressure is growing from increasing numbers of positive studies reporting an association between baseline dyssynchrony and ventricular response. One country's guidelines already include mechanical dyssynchrony in selection (14).
Tantalizing glimpses of reliable prediction of response continue to drive the search for mechanical dyssynchrony markers or multivariate combination algorithms to provide better prediction. But is this approach wise? Why do studies disagree? Reports of R2 exceeding 1.0 would be recognized as incorrect, but is the real upper limit of sustainably observable R2 really 1.0, or something lower? How can one calculate the highest plausible R2 between dyssynchrony and response? How should we interpret this clinically, and does it affect how we design future research?
The ceiling on R2 depends upon natural variability of dyssynchrony markers and of response markers. Blinded test–retest reproducibility data on mechanical dyssynchrony markers (15,16) and commonly used outcome markers of reverse remodeling are scarce. In this study we collate these and thereby calculate the true upper limit on plausible sustainably observable R2 between dyssynchrony markers and echocardiographic response.
We evaluate the implications for design and interpretation of studies seeking clinically reliable markers of mechanical dyssynchrony in particular, and for studies making claims of individualized prediction of response to any intervention in general.
Quantitative separation of device-mediated, versus spontaneous, changes in left ventricular ejection fraction (LVEF)
Randomized trials of biventricular pacing are the best way to separate spontaneous changes from device-induced changes in cardiac function. Patients undergoing biventricular pacing have 2 drivers of pre-to-post change in the chosen echocardiographic outcome measure (e.g., change in left ventricular ejection fraction [ΔLVEF]). First, inherent phenomena unrelated to biventricular pacing, including true biological variability and measurement error, will contribute to individual patients' ΔLVEF. The variance (square of standard deviation [SD2]) of ΔLVEF in the control patients in a randomized trial measures the size of this inherent scatter between successive measurements over time. Second, the device itself imposes an effect over and above the inherent variation. Because different patients (presumably) gain different amounts of effect from the device, ΔLVEF is more widely spread in the device patients than the control patients. The extra variance in ΔLVEF in the device patients is the variance caused by the device (Fig. 1). Only this extra variance, caused by the device, has any hope of being predicted by baseline dyssynchrony.
Meanwhile, baseline dyssynchrony markers also have inherent variability within a given patient over time. Only test–retest reproducibility studies reveal the extent of this. This is also true of multivariate combination algorithms used to score dyssynchrony because each component contributes something to inherent variability.
When correlating (r) two variables, such as mechanical dyssynchrony and echocardiographic response, or determining the predictive value of one on the other (R2), the measurement variability of both combine to depress the observable relationship strength (Fig. 2) (17). We term this the R2contraction factor because of the following relationship (Equation 1):(1)where underlying R2 is the potential correlation between the variables if all measurement noise could be eradicated. The online appendix shows full details. The R2 contraction factor is also a ceiling on sustainably observable R2 values because the underlying R2 cannot exceed 1.0.
The R2 contraction factor has 2 contributors: 1) contraction from response irreproducibility; and 2) contraction from dyssynchrony irreproducibility. Both are easy to calculate if data are available. Calculating the R2 contraction factor arising from response irreproducibility requires the SD of the Δ in the outcome measure in both the control arm and device arm of a randomized control trial. It is not sufficient to know the distribution of the initial and final LVEFs. Rather, the distribution of the change, that is, the SD of Δ, is needed. This can be used in the following calculation (Equation 2):(2)A similar formula is used for the mechanical dyssynchrony measure. The 2 contraction factors are then multiplied to determine the combined contraction factor (Equation 3).(3)The MIRACLE-ICD II (Multicenter InSync ICD Randomized Clinical Evaluation II) trial (18) can be used as a worked example. In the control arm, ΔLVEF has an SD of 6.2; in the biventricular pacing arm, ΔLVEF has an SD of 8. Therefore, the contraction factor imposed on R2 by the response marker LVEF is 1 – (6.2/8)2 = 0.40. Thus in populations like those in MIRACLE-ICD, even with an imaginary perfectly comprehensive and perfectly reproducible dyssynchrony marker, the highest R2 that could be sustainably observed with ΔLVEF would be 0.40.
In reality, mechanical dyssynchrony markers or scores do not have perfect test–retest reproducibility and so impose their own contraction factor. If, for example, the dyssynchrony marker imposed a contraction factor of 0.50, then the combined contraction factor would be 0.40 × 0.50 = 0.20. This means that even if the marker is completely comprehensive in describing all aspects of dyssynchrony (and there are no confounding features, e.g., scar or lead position), the maximum R2 observable is still only 0.20.
Data extraction from published studies
A systematic review of studies assessing the response to biventricular pacing was performed using the EMBASE and MEDLINE databases (Fig. 3). The terms cardiac resynchronization therapy, biventricular pacing, and dyssynchrony were used and abstracts reviewed for relevance.
All published studies that assessed mechanical dyssynchrony markers against ΔLVEF, ΔLV end-systolic volume (ESV), and ΔLV end-diastolic volume (EDV) were analyzed and had R2 data extracted (7–9) (Online Appendix References 1–40). R2 was calculated where necessary by squaring the correlation coefficient between the mechanical dyssynchrony marker and outcome measure, using the published data in tabular, text, or graphic form. Weighted averages of the R2 were calculated using the size of the study.
Studies reporting mechanical dyssynchrony markers were assessed to determine the test–retest variability of the markers and whether data were collected after formal enrollment with blinding. Studies that report SD within 1 patient and the SD across the population can have the R2 contraction factor from the mechanical dyssynchrony marker calculated as 1 – (SDwithin-patient/SDbetween-patient)2. Where test–retest variability is given as a correlation coefficient r, it was used as an estimate of the R2 contraction factor imposed by the dyssynchrony marker.
The landmark externally monitored randomized controlled trials (EM-RCTs) of biventricular pacing were assessed to determine the SD of ΔLVEF, ΔLVESV, and ΔLVEDV in the control and intervention arms (Online Appendix References 41–58). This enabled calculation of contraction factors of these response markers from rigorously performed, formally recruited, externally monitored heart failure populations.
Values are shown as mean (95% confidence interval [CI]), except where otherwise indicated. Comparisons between classes of study were made using the Student unpaired t test and the Mann-Whitney U test. A p value <0.05 was pre-defined as statistically significant. Stata/SE version 10.0 (StataCorp LP, College Station, Texas) was used to perform the statistical analysis.
Reported R2 for echocardiographic response in EM-RCTs and highly skilled single-center studies
Fifty-eight reports were identified and assessed. The majority were retrospective cohort studies with or without a control group, performed in highly skilled single centers (HSSC) with specific interest in echocardiographic dyssynchrony markers and a track record of innovation in the field (Online Appendix references 41–58). The reported R2 in these studies between individual dyssynchrony markers and echocardiographic response to biventricular pacing (ΔLVEF, ΔLVESV, or ΔLVEDV) are tabulated in Table 1.
R2 values were weighted according to etiology of heart failure (ischemic heart disease vs. idiopathic); no statistically significant difference was found between the 2 groups (p = 0.38).
EM-RCTs establishing the use of biventricular pacing were assessed. Primary and secondary publications report a wide variety of potential R2 between the outcome of biventricular pacing and baseline measures of dyssynchrony (Table 1) (Online Appendix references 41–58). The reported R2 values found in EM-RCTs were significantly smaller than those found in the HSSC studies (HSSCSs) (p = 0.02) for response in ΔLVEF (0.40 vs. 0.07), ΔLVESV (0.24 vs. 0.06), and ΔLVEDV (0.53 vs. 0.01) (Fig. 4).
R2 contraction factor arising from outcome variable
EM-RCTs provided data sufficient to estimate the R2 contraction factor for the commonly used echocardiographic outcome measures (Table 2) and clinical response markers (Table 3). All 3 echocardiographic outcome variables (ΔLVEF, ΔLVESV, and ΔLVEDV) have sufficient variability in the control populations to give R2 contraction factors that limit observed R2 to modest values, even if the predictive dyssynchrony marker or combination algorithm was perfect and had no variability.
R2 contraction factor arising from the dyssynchrony variable
We assessed the published variability of mechanical dyssynchrony markers between repeated echocardiograms in the same patient (test–retest reproducibility) (Table 4). Only 3 studies report the true test–retest variability needed to calculate the contraction factor for each dyssynchrony marker. In one, within-patient variation and between-patient variation was small (15). The second assessed test–retest reproducibility of tissue Doppler imaging mechanical dyssynchrony markers and presented the R2 contraction factor directly when measured by 2 separate readers, giving an average value of 0.35 (16). The third randomized patients to biventricular pacing or medical therapy and reported the change in dyssynchrony indexes remeasured in the control population (70). The mean change and its SD were provided to us by the authors. Overall, the available contraction factors range from 0.16 to 0.92, averaging ∼0.6.
Combined R2 contraction factor
The combined R2 contraction factor between echocardiographic response and a dyssynchrony marker is calculated by multiplication. We estimate that for ΔLVEF it is 0.29 (0.6 × 0.48); for ΔLVEDV, 0.24 (0.6 × 0.40); and for ΔLVESV, 0.30 (0.6 × 0.50). These are point estimates and may overestimate or underestimate the true combined contraction factor. Table 5 displays the likely values with the most likely region in boldface.
Comparing study design between HSSCSs and EM-RCTs
We assessed whether studies specified 3 key design features that limit bias: 1) predictive marker stated to be measured blinded to outcome; 2) outcome marker stated to be measured blinded to the patient's treatment with biventricular pacing; and 3) patients stated to be formally enrolled before the measurements were made (Table 6). The majority of the EM-RCTs report some degree of blinding; almost all of the HSSCSs did not (odds ratio: 70 [95% CI: 6 to 777; p < 0.0001] for response markers; odds ratio: 32 [CI: 3 to 306; p < 0.01] for dyssynchrony markers).
To determine the impact of publication bias a funnel plot was performed (Fig. 5). Study size had a weak but positive relationship with the publishing larger R2 values (r2 for this relationship = 0.25, p < 0.01) indicating the evidence of publication bias favoring positive studies.
To determine the impact of study design that affected R2, we plotted R2 against study size and the number of bias-resistance features (Fig. 6). Each study scored 0 or 1 point for each of 3 features. When the presence of a feature was unclear, no point was given. While smaller studies showed a tendency to report higher R2 values than larger studies, the number of bias-resistance features showed a stronger influence. Studies designed from the outset to resist bias never showed high R2 values. Very high R2 values occurred exclusively in studies that described little or nothing done to resist bias.
This article shows how to determine the ceiling on the sustainable R2 values between any baseline marker and subsequent response to any intervention, such as biventricular pacing. Readers can use this ceiling, or contraction factor, to judge whether an R2 is credible. The greater the test–retest variability in any predictor and/or response—whether due to biological factors, measurement error, or random noise—the lower the limit on the sustainably observable R2 (20,22).
This approach could be used to screen the plausibility of any claim of a baseline marker apparently predicting the effect of any intervention on any variable.
The ceiling to observed R2
Spontaneous variability in both dyssynchrony markers and response markers conspire to limit observed R2 to only low values. This happens with any etiology of heart failure, and for both single-variable and multivariate risk indexes. HSSCSs consistently report higher R2 values than EM-RCTs, and frequently exceed the mathematical ceiling. This suggests that the HSSCS method is in error.
This study exposes the prerequisites for reliable prediction of individual response, which are challenging. Even a theoretical, perfectly comprehensive dyssynchrony marker (whether a single-variable or multivariate algorithm) that incorporates every facet of responsiveness to biventricular pacing will still have a low ceiling. This is because it will have spontaneous variability, as will the marker of response, and these 2 contraction factors multiply to limit R2.
These findings are important because some studies forget the impact of variability within predictors and response markers, and so have unrealistic expectations. Our study shows that many published studies markedly overestimate the predictive effect of dyssynchrony markers compared with studies having formal enrollment and blinded analysis.
This should not be mistaken as a criticism of workers' integrity, but rather a failing in all of us in underestimating the importance of aspects of study design that might appear superficially uninteresting or trivial. The time-honored approach of hypothesizing correlations and then finding confirmatory evidence in one's local clinical data is incorrect because it provides results that are not only too high but actually above the mathematical ceiling. Strong prediction is sustainable only if both contraction factors are almost 1. Not only are they nowhere near 1, but comparatively little effort has been put into establishing what they are.
Can the contraction factor be improved?
Only genuinely reducing variability in both predictors and response markers can improve the contraction factor. Formal blinded test–retest reproducibility (“other day, other hands, other eyes”) of the markers must be carried out. Methods should then be refined or rejected, and the cycle iterated, until a protocol is obtained that reliably delivers high reproducibility in independent, blinded hands. It is the measurement protocol, and not the operators, that is being tested. If wide test–retest variability is observed, then that is the result of the study design and it should be reported dispassionately. Operators should not be blamed for reporting the truth. Some planners mistake published data on remeasurement for test–retest reproducibility. Others collect it too late to change the study protocol. Worse still, some collect it only under pressure from journal reviewers after study completion, at which stage there is overwhelming pressure to report a narrow variability even if unrepresentative.
Only markers with strong test–retest reproducibility should even be considered for expensive trials of individualized prediction (Table 5). Unless very much narrower than the population distribution of these variables, the markers should be rejected or refined before initiating any major study. Effort expended on maximizing the ratio of signal (between-patient genuine variability) to noise (within-patient variability) is indispensable to improving prediction of response.
Opportunities and limits of replicate averaging
Increasing patient recruitment will not raise the ceiling on sustainably observable R2; instead it enforces the same mathematical ceiling more firmly by reducing scope for fluke associations.
The impact of noise can, however, be reduced by making multiple replicate measures per patient and using the average (19,21). However, signal itself is not increased by replication and so sustainably observable R2 can rise as high as the underlying R2 but no higher.
When performing replicates for averaging, researchers must avoid the natural temptation to choose replicates that appear similar, ignoring apparent outliers. Measurements are best made without reference to each other, to maximize the statistical advantages of replicate averaging (22,23).
Researchers should also resist the clinically natural temptation to choose the replicate most representative of the patients' full clinical status, because doing so may innocently but powerfully bias the result toward confirming whatever association the researcher believes (24). Clinical practice is heavily dependent on such application of common sense, but unfortunately this is why routinely acquired unblinded clinical measurements are unsuitable for testing whether clinically believed associations are true. The stronger the belief, and the wider the range of values available (22,24), the greater the danger of self-deception.
In practice, replicates are often made a few seconds apart, but this does not capture variability over hours and days, or sensitivity to imperceptible differences in probe position (23). Paradoxically, an average of replicates would be more consistent over time if the measurements aggregated within each average were done on separate days (i.e., their internal variability had been maximized) rather than on successive beats (23). Averaging can only reduce the influence of variability the full spread of which is captured among the data points averaged.
The final advantage of replicate measurement, if “conducted on another day, acquired by other hands, viewed by other eyes,” is that it exposes irreproducible markers for early dismissal.
Why some HSSCSs report higher R2 (and higher than mathematically sustainable limits) than do EM-RCTs
Several factors may have contributed to HSSCSs' reporting significantly higher R2 values than the sustainable values found in EM-RCTs (Fig. 4).
High R2 values may have been found by statistical chance and then published with preferential enthusiasm. This could occur as submission bias from research groups and/or acceptance bias from journals.
Successive HSSCS publications from the same site may have overlapping patient cohorts. Patients might understandably be added to a growing database from which publications naturally arise. High R2 occurring by chance in early cohorts would repeatedly contribute to subsequent publications.
Preferential Recruitment of Patients
Selection of extra patients who have unusually severe or mild mechanical dyssynchrony, or who have unusually large changes in the response variable, will artificially magnify R2.
Lack of Blinding
The R2 between mechanical dyssynchrony markers and response markers can reliably inform prospective clinical practice only if each measurement is performed by observers blinded to the other relevant measurements in that patient. Mechanical dyssynchrony should be measured without knowledge of the LVEF, and vice versa. Dyssynchrony markers are sensitive to adjustment of cursor position, and operators might inadvertently “dial in” (25) the expected dyssynchrony if unblinded. Ventricular function assessment is similarly sensitive to choices during acquisition and during analysis. Clinicians are generally right to preferentially select plausible rather than implausible values. Unfortunately, applying this habit in research is dangerous, because if the clinician already believes the hypothesis, even minor and innocent influence will raise R2 dramatically (24). Concealment of electrocardiography (which shows biventricular pacing spikes) is essential during analysis if unbiased ΔLVEF is sought. The majority of the EM-RCTs report blinding of dyssynchrony and response measurements (Table 6); almost all of the HSSCSs did not.
Selective Inclusion or Exclusion of Particular Patients
HSSCSs may receive unusual referral patterns distorting the distribution of dyssynchrony markers away from the pattern typically seen by future clinical practice and in EM-RCTs. Finally, HSSCSs, if done without the advantage of formal, sequentially numbered, prospective enrollment of patients, may end up unintentionally analyzing an incomplete subset of the population at that center (24); patients with notably strong concordance between physiological expectation and clinical response are especially unlikely to be forgotten, and their preferential recollection would persistently bias R2 upward.
Why do inflated reports gain circulation, and how can recurrences be prevented?
It is tempting to blame publication bias (i.e., failure to publish negative studies). However, the standard funnel plot (Fig. 5) shows that publication bias is a minority contributor. The overwhelming determinant is the vulnerability of the study design to bias, as shown on the combined design-and-size plot (Fig. 6).
The responsibility may lie more properly with us as an audience for several weaknesses in application of normal scientific critique.
Our community accepted uncritically the term cardiac resynchronization therapy. With repetition it became obvious that quantifying mechanical dyssynchrony (which can only refer to ventricular timings because atrium and ventricle should not be synchronous) must quantify degree of benefit. Obvious, but not necessarily true. With experimental investigation of the therapy's mechanism of action still at an early stage, we might reduce cognitive distortion by using a neutral term such as biventricular pacing (26).
Physical science audiences judge a scientific finding by the precise nature of the experiment, the attention to detail, and the track record of previous claims being verified by others. Cardiologic audiences may not apply the same level of scrutiny (in particular, bias resistance is rarely debated) and may apply the availability heuristic (judging the credibility of sources from public visibility rather than track record of reliability). Audiences could usefully restore habits from their earlier scientific training.
Hearing each year of novel predictive markers with progressively more excellent predictive capacities, cardiologic audiences forget to ask what happened to markers of years past. If 2 different markers predict excellently, they must agree almost perfectly; if the latter is not the case, the former is not credible. Enhanced audience memory would help resist successive overstatements.
In physical science, any reported efficacious new approach is rapidly tested in small experiments by the audience. Cardiologic audiences may feel unable to do this. Yet simple experiments taking only minutes can quickly reject some claims. One example is the evaluation of blinded test–retest reproducibility and the application of the formulas in this paper. Another is adjustment of interventricular delay across a wide range in a single biventricular pacemaker patient, with blinded measurement of mechanical dyssynchrony; if this does not show a clear minimum in this highly controlled environment, it cannot work across a population (27,28).
We all want our specialty of echocardiography to be relevant. Reports of successful application are therefore intrinsically popular. But this is failure to separate our individual skill as echocardiographers from the ability of an echocardiographic technique to deliver what is claimed. Distinguishing falsity of a hypothesis from personal inadequacy requires courage but is the hallmark of science.
Even when experts carefully review available methods and tabulate that dyssynchrony markers are intensely vulnerable to noise and sometimes choice of measurement location so that there is risk for “dialing in” any desired level of dyssynchrony (25), they may be too polite to explain the quantitative implication for claims of response prediction (29).
We frequently confuse bias (which arises from study design) with chance (which is addressed by p values). Larger study sizes cause even minor systematic biases to become more statistically significant. This confusion afflicts even expert audiences, such as those writing guidelines, who often consider observational studies (if large) to be the same level of evidence (“B”) as a randomized controlled trial. Yet ironically, larger observational study size increases susceptibility to bias, making them less reliable guides to therapeutic decisions.
We frequently confuse remeasurement of identical digital images with genuine test–retest reproducibility, entirely ignoring the majority of variability, which occurs between beats. Test–retest variability can be readily checked by clinicians (30).
In science, a finding beyond the bounds of plausibility, such as faster-than-light travel (31), is highlighted as suspicious by the authors even with a p value ∼2 × 10–9. The error was found to be unrecognized measurement bias, and the scientific record corrected (32). Our field could encourage timely retraction of clinical reports that are discovered to be unrepresentative, giving credit to authors who report what went wrong in their own studies.
Clinical implications for mechanical dyssynchrony
This paper provides a simple method for clinicians to calculate the ceiling on plausible claims of predictability of ventricular response. It may seem surprising that variability in repeated measures can matter so much because such variability does not seem to impede normal clinical practice or trials addressing group mean effects. However, even small variability inevitably prevents accurate individualized prediction of response.
This approach may seem somewhat mathematical. However, publications stating an R2 or correlation are making a mathematical assertion of association strength. This paper shows how the same mathematics that underlie R2 calculations also demarcate the upper limit of plausibility, which is far below 1.0, exposing some assertions as anomalous.
Some may suspect that if ischemic scar or imperfect left ventricular lead positioning could be excluded, mechanical dyssynchrony markers might provide good prediction. These factors make prediction more difficult, that is, the ceiling is even lower than we describe here. Our calculations show that even if scarring, mispositioning, and all other confounders could be eliminated, the highest sustainable R2 value would still be low.
Nor can multivariable prediction by composite markers evade these difficulties. Spontaneous variability in response markers remains and may be worse for composites of poorly reproducible components. Thus composite markers will likely have an even lower ceiling on R2.
The search for predictive markers stems from a desire to optimize the resource cost of biventricular pacing. However, resources expended in identifying predictors unreliably would be better expended first screening predictors and response markers for blinded test–retest reproducibility. Early exposure of poor reproducibility would forestall reports of prediction that are destined not to stand the test of time.
Implications for research into mechanical dyssynchrony
Our analysis rejects not the concept of mechanical dyssynchrony, but rather the value of unblinded, informally enrolled studies of prediction (Fig. 5). Outcomes are known to be better in those device recipients who have greater dyssynchrony but observational study design cannot distinguish whether the better outcome would have happened without the device, or occurred as a result of the device. The continuous-variable analysis of the CARE-HF (Cardiac Resynchronization in Heart Failure) randomized controlled trial shows both mechanisms occurring simultaneously (33). Nondevice patients had progressively better outcomes the more dyssynchrony they had at baseline (33).
Approaches that have worked for addressing group mean effects cannot be uncritically expected to determine which individual patients benefit the most from biventricular pacing. This would need measurements of the effect of biventricular pacing on individual patients that have narrow within-individual error bars. Symptoms or outcomes assessed in the conventional way are not suitable, but quantitative physiological measurements could be developed to deliver this (33–35).
Wider implications for cardiologic research
Study design can overwhelmingly determine study results. In this example, a “perfect storm” of excellent plausibility—a clear survival benefit from the devices, clinical enthusiasm, unnoticed poor test–retest variability, underestimated impact of unblinded measurement, and lack of community awareness of ceilings on predictability—has made the literature as a whole unreliable.
But similar overstatements may be occurring elsewhere, unnoticed. Our cardiologic community should improve its ability to challenge claims. We must recognize that study size alone does not guarantee reliability (36); bias must be actively removed by careful planning. We should be emboldened to question early pioneering work because history shows it is often discredited later (37–39). Negative studies may seem superficially unexciting to journals, but carefully designed studies replicable by readers, contradicting prevailing beliefs, are the lifeblood of genuine science. We should prize not extreme claims but reliable experiments that can be checked by readers. When results are incompatible, groups should collaborate to understand why.
Few real-life outcomes are overwhelmingly determined by 1 variable; almost always, multiple features (including those that cannot currently be quantified) matter and interact in a way that cannot be captured in a single diagnostic marker. Therefore, we should react with surprise if a single variable is reported to predict any outcome with high certainty.
Finally, this recent “bubble market” in mechanical dyssynchrony should inoculate our community with skepticism for claims of association strength and encourage the examination of the track record of whether previous claims have stood the test of time.
This study is limited by the haste with which reports of prediction of biventricular pacing response arrive in the literature before independent blinded evaluations of their test–retest reproducibility of their methods. No study seems to have performed a series of blinded measurements of dyssynchrony to permit true reproducibility SD to be evaluated. We have had to use the SD of difference between just two, which is an imperfect estimate of this.
Furthermore, few reports of claimed prediction of response present sufficient information to determine the distribution of the change induced by biventricular pacing, or test–retest reproducibility of either predictors or outcome measures. We have used rigorously performed EM-RCTs, which have control populations to assess the inherent variability of response markers over time periods over which response to device implantation is normally measured. However, these control populations are only similar (through randomization) to their corresponding intervention populations, rather than being identical. Therefore, the estimates of the R2 contraction factor may slightly under- or overestimate the true R2 contraction factor. Individual studies may give an estimate <0, especially if the true contraction factor is near zero and/or the statistical characteristics of the Δ are non-normally distributed, as might occur when an analysis is unblinded.
These studies had different types of patients with differing etiologies, echocardiographic criteria, and outcome measures. This would limit the strength of a conventional meta-analysis, but it increases the generalizability of the findings from our study. No marker was strongly predictive when tested in bias-resistant designs, despite covering many spectra of patient populations. There was no relationship of the R2 values to the proportion of patients who had ischemic etiology of heart failure.
Many different mechanical dyssynchrony markers have been assessed, some basic and some sophisticated. They are not all directly comparable. For openness and completeness, we fully report all of the markers for which R2 ceiling was calculable (Table 1). Most have only modest values, including the newer, more sophisticated ones. Markers involving multiple steps to measure have more sources of variation and are likely to have a worse R2 contraction factor.
Some studies recently report not R2 but area under curve or a sensitivity analysis. The underlying variability remains, and an equivalent of the R2 contraction factor affects all measures of association.
New studies have chosen to measure alternative methods of echocardiographic response, such as global strain. We specifically chose to review the commonly used echocardiographic response markers (LVEF, LVESV, and LVEDV). This is because, first, these markers are commonly measured by echocardiographic laboratories. Second, they are the most familiar to clinicians. Third, it has been accepted that improvement in these is related to outcome (40). Finally, they have been tested in formally enrolled EM-RCT settings allowing reliable assessment of spontaneous variability. Our findings, however, are applicable to all future predictors and response markers. Each must have low test–retest variability for the contraction factor to be sufficient for any chance of clinically impressive prediction of response.
No scientifically conducted study will give reliable individual patient prediction (e.g., R2 > 0.5), by any current baseline marker of mechanical dyssynchrony, of any current marker of response to biventricular pacing, across a representative range of patients, with current measurement protocols. Unsustainably high R2 values arise through honest efforts in HSSCSs not because the studies are small but because they unknowingly have inherently unreliable designs.
It may be time to critically reassess the utility of HSSCS literature on the prediction of ventricular response to mechanical dyssynchrony. The overstatement of relationship strength by 5- to 20-fold, and the large proportion of results exceeding the mathematically possible ceiling, indicate that unblinded studies without formal enrollment have failed us entirely.
It is not credible to attempt to usefully predict ventricular response in individual patients, or to embark on further research into such predictive power, unless 2 substantial methodologic advances arrive: 1) protocols for measuring mechanical dyssynchrony that reliably give high test–retest reproducibility in the hands of multiple centers beyond their originators, under formal blinded conditions; and 2) methods for quantifying “response” in a way that, in patients who undergo no intervention, shows minimal within-individual change over time, in blinded externally monitored analysis, over time periods similar to those over which biventricular pacing response is typically measured.
The latter may be biologically impossible, in which case reliable individualized prediction of response is impossible.
To prevent future bubble markets of ineffective diagnostics, we should not take seriously any more unblinded, unenrolled studies that make mathematically implausible claims.
For appendix, please see the online version of this article.
The Limit of Plausibility For Predictors of Response: Application to Biventricular Pacing (Cardiac Resynchronization Therapy), Systematic Review and Design Steps For Reliable Research
Research by Dr. Nijjer is supported by the Medical Research Council (grant no. G110043). Research by Drs. Francis and Pabari is supported by the British Heart Foundation (grant nos. FS/10/038 and PG/08/114, respectively). Dr. Leyva has received financial support for research from Medtronic, St. Jude Medical, Sorin, and Boston Scientific. Dr. Linde is a consultant to Medtronic, is the principal investigator for the REVERSE trial, sits on the Data Safety and Monitoring Boards of Biopace study sponsored by St. Jude; and a consultant to CARDIOMEMS. Dr. Freemantle sits on the Data Safety and Monitoring Boards for Medtronic Trials and has received consultancy and travel fees for Medtronic Data Safety and Monitoring Boards. All other authors have reported that they have no relationships relevant to the contents of this paper to disclose.
- Abbreviations and Acronyms
- confidence interval
- externally monitored randomized controlled trials
- highly skilled single-center studies
- left ventricular end-diastolic volume
- left ventricular ejection fraction
- left ventricular end-systolic volume
- standard deviation
- Received September 7, 2011.
- Revision received July 5, 2012.
- Accepted July 9, 2012.
- American College of Cardiology Foundation
- Young J.B.,
- Abraham W.T.,
- Smith A.L.,
- et al.,
- Multicenter InSync ICD Randomized Clinical Evaluation (MIRACLE ICD) Trial Investigators
- Bristow M.R.,
- Saxon L.A.,
- Boehmer J.,
- et al.,
- Comparison of Medical Therapy, Pacing, and Defibrillation in Heart Failure (COMPANION) Investigators
- Yu C.M.,
- Fung J.W.,
- Zhang Q.,
- et al.
- Bleeker G.B.,
- Mollema S.A.,
- Holman E.R.,
- et al.
- Marcus G.M.,
- Rose E.,
- Viloria E.M.,
- et al.,
- VENTAK CHF/CONTAK-CD Biventricular Pacing Study Investigators
- Chung E.S.,
- Leon A.R.,
- Tavazzi L.,
- Sun J.P.,
- et al.
- Miyazaki C.,
- Redfield M.M.,
- Powell B.D.,
- et al.
- Hawkins N.M.,
- Petrie M.C.,
- Burgess M.I.,
- McMurray J.J.
- National Institute for Health and Clinical Excellence
- Gustafson P.
- St John Sutton M.G.,
- Plappert T.,
- Abraham W.T.,
- et al.,
- Multicenter InSync Randomized Clinical Evaluation (MIRACLE) Study Group
- Hutcheon J.A.,
- Chiolero A.,
- Hanley J.A.
- Pabari P.A.,
- Willson K.,
- Stegemann B.,
- et al.
- Francis D.P.
- Francis D.P.
- Abraham T.P.,
- Dimaano V.L.,
- Liang H.Y.
- Kyriacou A.,
- Pabari P.A.,
- Francis D.P.
- Pabari P.A.,
- Kyriacou A.,
- Moraldo M.,
- et al.
- Lumens J.,
- Leenders G.E.,
- Cramer M.J.,
- et al.
- MacRoberts M.H.,
- MacRoberts B.R.
- Finegold J.A.,
- Manisty C.H.,
- Cecaro F.,
- Sutaria N.,
- Mayet J.,
- Francis D.P.
- The Opera Collaboration
- Reich E.S.
- Richardson M.,
- Freemantle N.,
- Calvert M.J.,
- Cleland J.G.,
- Tavazzi L.,
- CARE-HF Study Steering Committee and Investigators
- Stegemann B.,
- Francis D.P.
- Bogaard M.D.,
- Meine M.,
- Tuineburg A.E.,
- Maskara B.,
- Loh P.,
- Doevendans P.A.
- Parolari A.,
- Tremoli E.,
- Cavallotti L.,
- et al.
- Hudes M.L.,
- McCann J.C.,
- Ames B.N.
- Yu C.M.,
- Bleeker G.B.,
- Fung J.W.,
- et al.
- Gumbel E.J.