The Effect of Cluster Randomization on Sample Size in Prevention Research

Article Type

Changed

Mon, 01/14/2019 - 11:04

Display Headline

Author(s)

BACKGROUND: This paper concerns the issue of cluster randomization in primary care practice intervention trials. We present information on the cluster effect of measuring the performance of various preventive maneuvers between groups of physicians based on a successful trial. We discuss the intracluster correlation coefficient of determining the required sample size and the implications for designing randomized controlled trials where groups of subjects (eg, physicians in a group practice) are allocated at random.

METHODS: We performed a cross-sectional study involving data from 46 participating practices with 106 physicians collected using self-administered questionnaires and a chart audit of 100 randomly selected charts per practice. The population was health service organizations (HSOs) located in Southern Ontario. We analyzed performance data for 13 preventive maneuvers determined by chart review and used analysis of variance to determine the intraclass correlation coefficient. An index of “up-to-datedness” was computed for each physician and practice as the number of a recommended preventive measures done divided by the number of eligible patients. An index called “inappropriatness” was computed in the same manner for the not-recommended measures. The intraclass correlation coefficients for the 2 key study outcomes (up-to-datedness and inappropriateness) were also calculated and compared.

RESULTS: The mean up-to-datedness score for the practices was 53.5% (95% confidence interval [CI], 51.0%-56.0%), and the mean inappropriateness score was 21.5% (95% CI, 18.1%-24.9%). The intraclass correlation for up-to-datedness was 0.0365 compared with inappropriateness at 0.1790. The intraclass correlation for preventive maneuvers ranged from 0.005 for blood pressure measurement to 0.66 for chest radiographs of smokers, and as a consequence required that the sample size ranged from 20 to 42 physicians per group.

CONCLUSIONS: Randomizing by practice clusters and analyzing at the level of the physician has important implications for sample size requirements. Larger intraclass correlations indicate interdependence among the physicians within a cluster; as a consequence, variability within clusters is reduced and the required sample size increased. The key finding that many potential outcome measures perform differently in terms of the intracluster correlation reinforces the need for researchers to carefully consider the selection of outcome measures and adjust sample sizes accordingly when the unit of analysis and randomization are not the same.

In conducting research with community-based primary care practices it is often not feasible to randomize individual physicians to the treatment conditions. This is due to problems of potential contamination between intervention and control subjects in the same practice setting or because the success of the intervention demands that all physicians in the practice setting adhere to a guideline. As a result, the practice itself is randomized to the conditions.

The randomization of physicians in groups, rather than each individual separately, has important consequences for sample size, interpretation, and analysis.^1-3 It is argued that groups of physicians are likely to be heterogeneous,⁴ giving rise to a component of variation that one must take into account in the analysis and that one can control only by studying many groups of physicians rather than many physicians within each group.⁴

Randomizing physicians by cluster and then analyzing the data by physician or patient has the potential to introduce possible bias in the results. It has been noted that many studies randomized groups of health professionals (cluster randomization) but analyzed the results by physician, thus resulting in a possible overestimation of the significance of the observed effects (unit of analysis error).⁵ Divine and colleagues⁶ observed that 38 out of 54 studies of physicians’ patient care practices had not appropriately accounted for the clustered nature of the study data. Similarly, Simpson and coworkers⁷ found that only 4 out of 21 primary prevention trials included sample size calculations or discussions of power that allowed for clustering, while 12 out of 21 took clustering into account in the statistical analysis. When the effect size of the intervention is small to moderate, analyzing results by individual without adjusting for the cluster phenomena can lead to false conclusions about the significance of the effectiveness of the intervention. For example, Donner and Klar⁸ show that for the data of Murray and colleagues⁹ the P value would be .03 if the effect of clustering were ignored, while it was greater than .1 after adjusting for the effect of clustering.

Using baseline data from a successful randomized controlled trial of primary care practices in Southern Ontario, Canada,¹⁰ we will explain the intracluster correlation coefficient (ICC) in determining the required sample size of physicians. The ICC is a measure of variation within and between clusters of physicians. It is a measure of the clustering effect or the lack of independence among the physicians that make up the cluster. The smaller the ICC, the more likely the physicians in the cluster behave independently, and analysis at the level of the physician can proceed without significant adjustment to sample size. The higher the ICC, the more closely the measure quantifies class or group rather than the individual physician, and the effective sample size is decreased to the number of classes rather than the number of individuals. Our objective was to provide information on the cluster effect of measuring the performance of various preventive maneuvers between groups of physicians to enable other researchers in the area of primary care prevention to avoid errors.

Methods

As part of a larger clinical trial to improve preventive practice, we conducted a cross-sectional study to provide a point estimate of preventive performance in capitation primary care practices. We chose the preventive maneuvers from the Canadian Task Force on the Periodic Health Examination.¹¹ According to their classification system, there is randomized clinical trial evidence to support “A” level (highly recommended) maneuvers and cohort and case controlled studies to support “B” (recommended) maneuvers. The task force also reviewed the quality of evidence for maneuvers that should not be done and identified these as “D” level maneuvers. Eight A and B level recommendations and 5 D level recommendations were identified by a panel of practicing family physicians. Selection criteria included the need to represent a broad spectrum of preventive interventions for both men and women patients of all ages, and the need to address diseases that were clinically important. The 8 recommended and 5 inappropriate maneuvers chosen for our study are listed in Table 1.

This study was conducted in 72 community-based health service organizations (HSOs) in Ontario located at 100 different sites primarily in the Toronto, Hamilton, and London areas in the spring of 1997. The Ottawa Civic Hospital research ethics committee approved our study.

Data Collection

Practice and physician characteristics were collected using a self-administered questionnaire to which 96% of 108 participating physicians responded (Table 2 has the questionnaire items). Preventive performance at the physician and overall practice level was determined using a chart audit.

Chart Audit. Patient charts were eligible for inclusion in the medical audit if they were for patients who were aged 17 years or older on the date of last visit and had visited the HSO at least once in the 2 years before the audit. The variables collected from the charts included demographic and patient characteristics as well as indicators of performance of preventive maneuvers.

The chart auditors obtained a list of patients within an HSO practice group of physicians and then randomly selected charts using computer-generated random numbers. The patient list was either constructed by the auditors or by using the medical office record computer system. The list included all rostered and nonrostered patients. Unique chart numbers or numeric identifiers were assigned to each patient. The required number of charts was randomly selected from the sampling frame, the chart was pulled, and eligibility for inclusion was determined. The auditors proceeded to find charts at random from the sampling frame until they obtained 100 eligible charts per practice.

To verify the quality of the data entered from the 100 randomly selected charts and to measure the inter-rater reliability between auditors, 20% of each HSO’s audited charts were independently verified by another auditor. If coding discrepancies were found in more than 5 of 20 charts, the entire 100 charts were audited and verified again.

Data Analysis. Our analysis with SPSS software version 8.0 (SPSS Inc, Chicago, Ill) focused primarily on calculating the extent to which each preventive maneuver was being performed according to the recommendations of the Canadian Task Force on the Periodic Health Exam. An index of “up-to-datedness” was computed for each physician and practice as the number of A and B preventive measures done divided by the number of eligible A and B measures. In addition, an index called “inappropriateness” was computed in the same manner to represent the D measures.

Frequencies and descriptive statistics were generated on all variables, and each variable was checked for data entry errors and inappropriate or illogical responses. Means and standard deviations were computed for continuous variables and frequency distributions were computed for categorical variables, such as sex and age group. In addition, chi-square tests were used to compare the background characteristics of participating and nonparticipating HSO physicians. Ninety-five percent confidence intervals were calculated for the mean preventive indexes. Finally, kwas computed as a measure of reliability between the 2 chart auditors.

The ICC was calculated for sample cluster means¹² of up-to-datedness and inappropriateness. The practice characteristic data revealed that the mean cluster size in terms of number of physicians per practice was 2.8 with a variance of 3.6 and a total of 106 physicians across 46 practices. To determine the between-subjects (practices) variance (s_b²) and within-subjects (practices) variance (s_w 2) for the ICC calculation, the one-way analysis of variance (ANOVA) procedure was run on both measures (up-to-datedness and inappropriateness) as well as each of the preventive maneuvers separately.¹³ ICC (D) was computed from the F statistics of the one-way ANOVA and the adjusted cluster size as follows: D = F-1/F+n₀-1 where n₀ is the mean practice size ([2.8] - [practice variance (3.6)/106 physicians]).

Finally, using the formula by Donner and coworkers¹⁴ and the ICC, the sample size for comparing 2 independent groups allowing for clustering for both up-to-datedness and inappropriateness was determined. The formula is: n = 2(Z₋_/2+Z₀₂)²F²[1+(x_c-1)D]/*² where n is the per group sample size; Z_−/2= 1.96 and Z₀₂= 0.84 are the standard normal percentiles for the type I and type II error rates at 0.05 and 0.20, respectively; Fis the standard deviation in the outcome variable; * is the expected difference between the two means; x_cis the average cluster size; and D is the ICC. The sample size calculations were based on an expected difference of 0.09 between groups (with 80% power and 5% significance) and a standard deviation of 0.10. Table 3 shows the effect on per-group sample size for varying ICC values and average cluster sizes.

Results

A total of 46 HSOs were recruited out of a possible 100 sites, for a response rate of 46% at baseline. The response rate to the physician questionnaire was 98% (106 of 108). Physicians in practices that agreed to participate differed significantly from those who did not. Participating physicians were younger, having graduated in 1977 on average compared with 1971 (t=4.58 [df=191], P<.001) and were more likely to be women, 30.4% compared with 9.9% for nonparticipating physicians (c²=11.09 [df=1, N=193], P=.001). Table 2 provides descriptive information on practice and physician characteristics. Five practices of 46 needed to have the entire 100 charts re-audited. Final concordance between the 2 auditors for each practice verification was 85% (k=.71).

The mean up-to-datedness score for the practices or the mean proportion of A and B maneuvers performed was 53.5% (95% confidence interval [CI], 51.0%-56.0%) and the mean inappropriateness score was 21.5% (95% CI, 18.1%-24.9%). In other words, on average, 53.5% of patients eligible for recommended preventive maneuvers received them and 21.5% of eligible patients received inappropriate preventive maneuvers.

Table 1 gives the practice mean square, the error mean square, the ICC, and the required sample size per group for the overall measures of up-to-datedness and inappropriateness as well as for 13 preventive maneuvers individually. For inappropriateness, there was more variability between practices than within practices among physicians, resulting in a larger practice mean square and a significant F statistic (P <.05). For up-to-datedness, the variability within practices among physicians was greater than the variability between practices, although not significantly so. Table 1 shows the intraclass correlation as 0.0365 for up-to-datedness and 0.1790 for inappropriateness. Inappropriateness scores were not normally distributed, and 2 physicians had scores greater than 0.60. However, with these extreme outliers removed, the ICC for inappropriateness remained high at 0.14.

The ICC ranges from 0.005 for blood pressure measurement to 0.66 for chest x-rays of smokers. The variability between and within group clusters is the same for blood pressure measurement. For chest x-rays of smokers the variability between clusters is very significant and within clusters it is small, indicating that some practice clusters perform a larger number of chest x-rays on smokers than other practices. However, the performance of chest x-rays was not normally distributed, with 79% of physicians not performing them and one solo physician with an extreme score of 0.53. With this extreme outlier removed the ICC for chest x-rays was 0.25, with a mean square between practices of 0.0024 and a mean square within practices of 0.0012 (P<.01). Table 1 shows the effect on sample size for analysis at the level of the physician as the ICC varies.

Discussion

Statistical theory points to the consequences of cluster randomization as a reduction in effective sample size. This occurs because the individuals within a cluster cannot be regarded as independent. The precise effect of cluster randomization on sample size requirements depends on both the size of the cluster and the degree of within-cluster dependence as measured by ICC.² Cluster randomized trials are increasingly being used in health services research particularly for evaluating interventions involving organizational changes when it is not feasible to randomize at the level of the individual. Cluster randomization at the level of the practice minimizes the potential for contamination between treatment and control groups. However, the statistical power of a cluster randomized trial when the unit of randomization is the practice and the unit of analysis is the health professional can be greatly reduced in comparison to an individually randomized trial.¹⁵

To preserve power the researcher should, whenever possible, ensure that the unit of randomization and the unit of analysis are the same.¹⁶ In this manner standard statistical tests can be used. Often this is not possible given secondary research questions that may be targeted to the health professionals within the practice and not the practice as a whole. If data are analyzed at the level of the individual and not at the level of the cluster (in effect ignoring the clustering effect), then there is a strong possibility that P values will be artificially extreme and confidence intervals will be overly narrow, increasing the chances of spuriously significant findings and misleading conclusions.¹⁵ When using the individual physician as the unit of analysis, one must take into account the correlation between responses of individuals within the same cluster. For continuous outcome variables that are normally distributed, a mixed-effects analysis of variance (or covariance) is appropriate, with clusters nested within the comparison groups.¹⁷ For dichotomous variables, Donner and Klar suggest that an adjusted chi-square test be used.⁸ Although we focus on the issue of clustering for study designs using random allocation, the issue of clustering is also apparent in cross-sectional and cohort studies, where the practice-level and/or physician-level factors may have an impact on patient-level data. Researchers need to be aware of the possibility of intracluster correlation and the implications for analysis in these studies as well.¹⁸

In the example presented, the ICC for the outcome measure “up-to-datedness” was approximately 0.04 in contrast to the ICC for inappropriateness, which was 0.18. The required sample size per group for the outcome measure “up-to-datedness” would be 21 physicians compared with inappropriateness, where the sample size would be 25 per group. In contrast, if the study dealt with improving smoking cessation counseling or reducing chest x-rays in smokers, the sample size would be 27 or 42 physicians per group. Treating the unit of analysis and the unit of randomization the same would require only 19 physicians per group.

Campbell and colleagues¹⁹ looked at a number of primary and secondary care study data sets and found that ICCs for measures in primary care were generally between 0.05 and 0.15. In contrast, in this study the ICCs ranged from 0.005 to 0.66, depending on the measure. The difference in ICC between measures and across studies is interesting, and we can only speculate why some measures show more interdependence. It is possible that inappropriateness taps phenomena such as policies at the practice level which physicians can not easily influence, while up-to-datedness may help explain how physicians even when working in the same practice setting behave independently when it comes to delivering recommended preventive care. It is important to be aware and not to assume that because one measure may show independence that all measures under study show the same independence. For example, blood pressure measurement and urine proteinuria screening are different in terms of ICC. Differences between outcome measures should be taken into account when calculating required sample size and in statistical analysis when the unit of randomization and analysis are not the same.

Limitations

There are 2 limitations with this research. First, analysis of respondents and nonrespondents to the recruitment effort showed that the study participants were more likely to be younger and women. This would imply that our findings may not be generalizable to the HSO population as a whole. Second, the measures of preventive performance were based on a chart audit and as a consequence are susceptible to the potential problems associated with chart documentation. A low level of preventive performance does not necessarily mean that prevention is not being practiced or that it is being performed inconsistently within a group practice. It may indicate that a less sophisticated documentation process is being used.

Conclusion

Physicians clustered together in the same practice do not necessarily perform the delivery of preventive services equally. As demonstrated by the measure “up-to-datedness,” there is relatively little correlation among physicians working together for performance of many preventive maneuvers. For some maneuvers, most notably those that may be automatically performed as part of practice policy, there is modest correlation among physicians who work together. We hope that these findings assist other researchers in their decision making around the need to adjust sample sizes for the effect of clustering.

References

1. Cornfield J. Randomization by group: a formal analysis. Am J Epidemiol 1978;108:100-02.

2. Donner A. An empirical study of cluster randomization. Int J Epidemiol 1982;11:283-86.

3. Kerry SM, Bland JM. The intercluster correlation coefficient in cluster randomisation. BMJ 1998;316:1455.

4. Gail MH, Mark SD, Carrol RJ, Green SB, Pee D. On design considerations and randomization based inference for community intervention trials. Stat Med 1996;15:1069-92.

5. Bero LA, Grilli R, Grimshaw JM, Harvey E, Oxman AD, Thomson MA. Closing the gap between research and practice: an overview of systematic reviews of interventions to promote the implementation of research findings: the Cochrane Effective Practice and Organization of Care Review Group. BMJ 1998;317:465-86.

6. Divine GW, Brown JT, Frazier LM. The unit of analysis error in studies about physicians’ patient care behavior. J Gen Intern Med 1992;7:623-29.

7. Simpson JM, Klar N, Donner A. Accounting for cluster randomization: a review of primary prevention trials, 1990 through 1993. Am J Public Health 1995;85:1378-83.

8. Donner A, Klar N. Methods for comparing event rates in intervention studies when the unit of allocation is the cluster. Am J Epidemiol 1994;140:279-89.

9. Murray DM, Perry CL, Griffen G, et al. Results from a statewide approach to adolescent tobacco use prevention. Prev Med 1992;21:449-72.

10. Lemelin J, Hogg W, Baskerville B. Evidence to action: a tailored multi-faceted approach to changing family physician practice patterns and improving preventive care. CMAJ. In press.

11. Canadian Task Force on the Periodic Health Examination The Canadian guide to clinical preventive health care. Ottawa, Canada: Health Canada; 1994.

12. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York, NY: John Wiley & Sons; 1981.

13. Bland JM, Altman DG. Measurement error and correlation coefficients. BMJ 1996;313:41-42.

14. Donner A, Birkett N, Buck C. Randomization by cluster: sample size requirements and analysis. Am J Epidemiol 1981;114:906-14.

15. Campbell MK, Grimshaw JM. Cluster randomised trials: time for improvement. BMJ 1998;317:1171-72.

16. Bland JM, Kerry SM. Trials randomised in clusters. BMJ 1997;315:600.

17. Koepsell TD, Martin DC, Diehr PH, et al. Data analysis and sample size issues in evaluations of community based health promotion and disease prevention programs: a mixed-model analysis of variance approach. Am J Public Health 1995;85:1378-83.

18. Feldman HA, McKinlay SM. Cohort versus cross-sectional design in large field trials: precision, sample size, and a unifying model. Stat Med 1994;13:61-78.

19. Campbell M, Grimshaw J, Steen N. Sample size calculations for cluster randomised trials. J Health Serv Res Policy 2000;5:12-16.

Author and Disclosure Information

BRUCE N. BASKERVILLE, MHA
WILLIAM HOGG, MD
JACQUES LEMELIN, MD
Ottawa, Ontario, Canada
Submitted, revised, December 12, 2000.
From the Department of Family Medicine, University of Ottawa. Reprint requests should be addressed to N. Bruce Baskerville, BA, MHA, Department of Family Medicine, University of Ottawa, 210 Melrose Ave, Ottawa, Ontario, Canada K1Y 4K7. E-mail: [email protected].

Issue

The Journal of Family Practice - 50(03)

Publications

The Journal of Family Practice

MDedge Family Medicine

Page Number

242

Legacy Keywords

,Cluster randomization [non-MESH]statistical sampling [non-MESH]preventive health servicesprimary health carerandomized controlled trials. (J Fam Pract 2001; 50:W241-W246)

Sections

Original Research

Author(s)

Bruce N. Baskerville, MHA