Calculating reliability: Methodology adapted for estimating stability of cost data in New York

February 1st, 2017 / By Michael Keyes, Ryan Butterfield, DrPH, MBA

Achieving New York’s goal of moving 80-90 percent of managed Medicaid dollars to Value-based payment (VBP) arrangements by 2020 will certainly be a challenge.  Most MCOs will need to reach deep into their network and contract with PCP groups that maintain relatively small panels.  Based on analysis of the clients with which 3M works, an average MCO can expect to see 30 percent of all medical expenses attributed to independent PCP groups with less than 1,000 assigned members.  Nearly all of them will need to be contracted in a VBP arrangement for the plan to meet the state’s VBP goals.  Based on this information, we’ve received many questions around what panel sizes will support risk adjusted VBP contracts.  To that end, we’ve engaged our research department in developing a recommendation based on rigorous analysis. 

Goal of this Analysis and Discussion
There is a need in the statistical literature for developing applied univariate methods which capture the stability of a measurement. Reliability is just such a measure. The development and use of a reliability statistic, as described by Adam, et al., (2010), was adapted for use in answering the following: At what panel size do measurements related to total cost of care and medical allowed (PMPM) stabilize enough for use in creating projections? Is there a range of possible panel sizes that show stability and could provide options for clients in their development of VBP programs? These and questions related to these will be discussed in this blog series.   

General Overview
While there are many techniques to calculate power and sample size, more work needs to focus on including effect size, coefficient value, or confidence interval range as part of the traditional statistical canon.. But how do you measure the accuracy of those calculations? This is commonly done through repetition and reliability. The statistical literature has several different approaches for capturing this kind of information. It can be done at the experimental design stage with something like a Gauge R&R design, or perhaps computationally using Cronbach’s Alpha or some other similar estimate. The nice thing about the reliability estimate presented by Adams and colleagues at RAND and adapted here by 3M HIS is that the hierarchical structure inherently present in payer/provider data can be built into the model and produce the needed components for the ratio. This allows a certain flexibility to calculation that is not commonly present in other approaches.

Adams, et al., described reliability to be “the ratio of signal to noise,” or in technical phrasing, the ratio of variation within a component to variation between components. In essence it serves as a QA check on the data being used:  How well is the ratio of variances behaving? Traditionally, the ratio closer to positive one is better, while closer to zero then the less reliable that data line.

Statistical Methodology
We developed the reliability estimate as a ratio of variance components. These components are derived from the variance estimates found in a multi-level random effects hierarchical general linear model. The outcomes were continuous and found to be a fairly close approximation to a log-normal distribution after risk adjustment using the 3M CRG, age, and sex weighting system.

The Clinical Risk Groups (CRGs) is a 3M classification methodology which is useful in settings where clinical information is needed to bring further insight into the data investigation. The levels used in this model were physician, physician group, and ACO when applicable. A minimum of two levels were used, so for instance, to find the reliability of the physician, the random effects were physicians nested within physicians’ group. If we were looking for the ACO, then we might try nesting physicians and physician group within the ACO variable. Overall the goal of this analysis was to test different person count thresholds and see how the reliability captured the behavior of the variance for our outcomes of interest.

Or alternatively,

Where n is the number of people in that panel size. For instance, if we are looking between 200 and 300 persons, we may have 150 physicians with that range of patients. Our n is then 150. Again reliability is a ratio measured between 0 and 1. With the common heuristic being that an average reliability of 0.7 yields the conclusion that a measure is reliable.

Overall, the reliability estimate first developed in this manner by Adams, et al., at RAND, later adapted by HealthPartners, and now 3M HIS, demonstrates the flexibility needed from statistical tools. We adapted this tool in an attempt to identify patterns of variance for a univariate outcome that sits within a complex hierarchical backdrop. Stay tuned for further discussion and real life examples of this useful tool in the healthcare data analyst’s toolbox. We will discuss selecting the appropriate panel sizes for estimating VBP inclusion and how we found and how we found for group size that bigger isn’t always better.

Michael Keyes is an engagement leader for Populations and Payment Solutions at 3M Health Information Systems.

Ryan Butterfield, DrPH, MBA, senior researcher and statistician at 3M Health Information Systems.

Special Thanks to: Erika Johnson, Chris Hensel, James Devine, Scott Clinton, Paul LaBrec, Melissa Gottschalk


  • Adams, John L., Ateev Mehrotra, J. William Thomas, and Elizabeth A. McGlynn. “Physician Cost Profiling — Reliability and Risk of Misclassification Technical Appendix.” New England Journal of Medicine 362.11 (2010): 13‐22. Print.
  • Adams, John L., Ph. D, Ateev Mehrotra, M.D., M.P.H, J. William Thomas, Ph. D, and Elizabeth A. McGlynn, Ph. D. “Physician Cost Profiling — Reliability and Risk of Misclassification.” New England Journal of Medicine 362.11 (2010): 1014‐021. Print.
  • Adams, John L., Ph.D. “The Reliability of Provider Profiling, A Tutorial.” RAND Corporation (2009). Print.
  • “HealthPartners Total Cost of Care and Resource Use.” HealthPartners Total Cost of Care and Resource Use White Paper (2012): 1‐12. Web. 19 Aug. 2012.
  • “HealthPartners Total Cost of Care and Resource Use Reliability Metric Analysis.” HealthPartners Total Cost of Care and Resource Use White Paper (2012): 1‐12. Web. 19 Aug. 2012.
  • R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
  • Crawley, M. J. (2013). The R book. 2nd Edition. Chichester, England: Wiley.