بسم الله الرحمن الرحيم
Posted date: 1970-01-01 00:00:00
Objectives To explore agreement among healthcare professionals assessing eligibility for work disability benefits.
Design Systematic review and narrative synthesis of reproducibility studies.
Data sources Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies.
Eligibility criteria Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations.Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed.
Results From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ−0.10). Inter-rater reliability was poor in six studies (37%) and excellent in only two (13%). This contrasts with studies conducted in the research setting, where the median inter-rater reliability was 0.76 (range 0.91-0.53), and 71% (5/7) studies achieved excellent inter-rater reliability. Reliability between assessing professionals was higher when the evaluation was guided by a standardised instrument (23 studies, P=0.006). No such association was detected for subjective or chronic health conditions or the studies’ generalisability to real world evaluation of disability (P=0.46, 0.45, and 0.65, respectively).
Conclusions Despite their common use and far reaching consequences for workers claiming disabling injury or illness, research on the reliability of medical evaluations of disability for work is limited and indicates high variation in judgments among assessing professionals. Standardising the evaluation process could improve reliability. Development and testing of instruments and structured approaches to improve reliability in evaluation of disability are urgently needed.
Many workers seek wage replacement benefits on the basis of disabling illness or injury, and over the past decade most countries of the Organisation for Economic Co-operation and Development (OECD) have experienced escalating rates of affected workers.12 Current estimates range from four to eight individuals per thousand per year,2 corresponding to 16 000 newly affected workers/year for smaller countries like Switzerland and 1 700 000/year for countries like the US.
Both public and private insurance systems provide wage replacement benefits for employees whose impaired health prevents them from working, as long as eligibility criteria are met.1 To inform this decision, insurers often arrange for evaluation of disability claims by medical professionals.345 Based on these evaluations, about half of all disability claims are declined.2
Equality before the law requires that claimants with similar health impairments and exposed to similar work demands should receive similar judgments of medical restrictions and limitations. Concerns have been raised, however, regarding low quality evaluations678 and poor reliability between medical experts.91011121314 Evaluation of disability is a complex process that is affected by the skillset, attitudes, and beliefs of the expert, and few countries enforce standards of practice,35 which presents considerable challenges to reliability (box 1).1516 We conducted a systematic review of reproducibility studies to summarise empirical evidence regarding the inter-rater reliability of global judgments on work disability and examined the hypothesis that studies using standardised assessments would show higher reliability.
Experts differ in their understanding of the demands of a certain job on the workers’ capacities and of the consequences of functional limitations on work performance
Experts differ in their personal value system on what level of effort, endurance, and discomfort can reasonably be expected by a claimant
Experts differ in their understanding of the legal requirements on a medical expertise that could affect their medical judgments
We followed the standards set by the Guidelines for Reporting Reliability and Agreement Studies (GRRAS)17 and Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA)18 for the reporting of our study.
We included reproducibility studies conducted in an insurance setting (evaluation of claimants) or in a research setting (evaluation of patients for work disability outside of actual assessments) in which two or more health professionals evaluated the work capacity of individuals claiming disability and reported inter-rater reliability on a global rating of work disability. Studies that reported only the inter-rater reliability of experts’ evaluation of specific physical or mental activities (such as lifting, conflict management) were excluded. All types of “subjects” qualified: real claimants, records of claimants, videotaped actors, vignettes, short case summaries.
We searched Medline, Embase, and PsycInfo from inception to 16 March 2016, without language restrictions. An experienced medical librarian (RC) developed database specific search strategies combining the following subject terms: reproducibility of results (MeSH, including reliability) and reliability statistics, disability or work capacity evaluation, and sick leave (see appendix 1 for the detailed search strategy). We screened the bibliographies of all included studies for additional relevant articles.
Three teams of paired reviewers (WdB, JWB, JH, SK, JS, RK) with expertise in medical evaluations and training in research methodology independently screened titles, abstracts, and full texts for eligibility, assessed generalisability, and collected data from each eligible study using standardised pilot tested forms with detailed instructions. Reviewers resolved disagreement through discussion or, if required, adjudication by a third reviewer (RK or WdB).
Quality appraisal of reproducibility studies includes methodological quality and generalisability to the setting in which the instrument will be used.171920 To address the former, we assessed the blinding of raters to each other’s findings, the risk of order effects, and appropriateness of the statistical analyses following Quality Appraisal for Reliability Studies (QAREL) guidance. To address generalisability, we evaluated whether claimants, raters, and the performance of the disability evaluation were similar to the insurance context in which such evaluations take place.1720
As reliability is a product of the interaction between the performance of the test, the subjects/objects, and the context of the assessment, and as its estimate is affected by various sources of the variability in the measurement setting (that is, rater and subject characteristics, performance of the test,17box 1), we used an explicit and transparent process to evaluate generalisability. Based on the checklist of QAREL,19 GRRAS,17 and expert guidance,20 we identified four claimant items and four expert items for defining greater generalisability:
The recruitment strategy captures diverse cases as would present in actual evaluation of disability (in declining order: random, consecutive, other recruitment; not applicable to written cases or videos)
Recruitment success (in declining order: >80%, 80-50%, <50%; not applicable to records of patients, videos, or written cases)
Verisimilitude—that is, the extent that cases reflect the population in real life (in declining order: real claimants, including videotapes/audiotapes of real claimants, records of real claimants, videos with actors, hypothetical patients, written cases)
Range of raters’ expertise in performing work disability evaluations (for example, wide range of experience that is comparable with the real world v narrow range of experience)
Medical experts with formal training in disability evaluation (for example, licensed disability raters, rehabilitation specialists) or without any specific training (no formal requirement, family physicians certifying sick leave), where experts without formal training—as is the case in most countries—closer resemble real life
No specific training for study purposes
Number of cases that more closely resembles real life (in declining order: >100, 31-100, 11-30, 6-10; 1-5)
Number of raters that more closely resembles real life (in declining order: >16, 11-15, 6-10, 3-5, 2)
We gave more weight to studies with a broader spectrum and a larger number of experts to reflect the wide variation among medical experts in actual disability assessment, which tends to contribute substantially to the measurement error.
Five reviewers (JB, RK, WdB, JS, JanHo), blinded to the study results, assessed generalisability of each study, independently and in duplicate. Given the lack of empirical evidence about the relative importance of each item we used a sequential approach from medical decision making21 to make the weighting of each item explicit (see appendix 2 for detailed description). This approach facilitated judgments regarding overall generalisability (that is, “generalisable,” “probably generalisable,” “probably not generalisable,” and “not generalisable”). We calculated the reviewers’ concordance in generalisability ranking using Kendall’s W (coefficient of concordance), which generates values between zero (no agreement) and one (perfect agreement).
We limited assessment of generalisability to studies performed in an insurance setting because studies conducted in a research setting, by definition (“normative or legal considerations not part of the judgment”, see data analysis), lack generalisability to real life assessments of disability.
We extracted the following information from each eligible study:
Study context—background and setting (insurance, rehabilitation, research)
Patients’ characteristics (“cases”)—number of cases per study; presenting disorder(s) (mental disorder, musculoskeletal disease, mixed); course of disease or injury (acute, chronic)
Expert characteristics (“raters”)—number of raters per study; number of cases per rater; number of raters per case; profession (primary or secondary care, occupational physician, insurance physician)
Procedures—time frame before the evaluation for judging current health status and work disability; time frame for predicting global work disability (for both time frames, short term refers to less than six months; long term refers to more than six months; mixed); instrument (professional expertise with or without specific rating instrument) to support global rating of work disability and the related categories (for example, fully limited, partially limited, no limitations) or scales (for example, scale 0-100)
Outcomes—global rating of work disability (for example, work capacity, sick leave, readiness for return to work, reduction in working hours); decisions on suitability for a specific job; occupational functioning); measure of reliability or agreement (intraclass correlation coefficient (ICC), κ statistic, or percentage agreement), including measure of precision, or descriptive measure (for example, frequency of judgments).
We distinguished between studies conducted in an insurance setting or a research setting. In an insurance setting, health professionals make judgments on disability for work based on functional limitations that includes normative judgments from a societal perspective. An insurance setting does not imply any specific format of the claimant’s presentation in the study, which can range from a real patient to a written case (see also “generalisability, verisimilitude”). Researchers in a research setting who develop and/or validate instruments tend to standardise their research environment when judging occupational functioning. Normative (legal) considerations or a societal perspective are not part of their judgments.
We used studies conducted in a research setting to investigate the association between level of standardisation in the evaluation process and inter-rater reproducibility. Level of standardisation was considered as “not standardised” when medical experts in the insurance setting used only their professional expertise to elicit information and rate findings from the claimant; as “semi-standardised” when they used a structured instrument as one component of the evaluation; and as “fully standardised” when occupational functioning was primarily evaluated with a structured instrument.
Lack of information on variation associated with reproducibility statistics and heterogeneity of statistical measures and outcomes precluded pooling of the data across studies. Using a two tailed Fisher’s exact test, we explored whether objective (versus subjective) and acute (versus chronic) health conditions as well as higher levels of generalisability and/or higher levels of standardisation in the evaluation process were associated with a higher inter-rater reproducibility. We defined mental disorders as “subjective complaints” and somatic disorders as objective complaints, though we acknowledge the crude nature of this classification, and acute conditions shorter than six months and chronic conditions longer than six months. We excluded from our analysis three studies that did not specify the chronicity. Fisher’s exact test does not provide a test statistic, only whether the difference is significant or not.
For clinical interpretation of reliability measures, we used the thresholds established by Fleiss in 198125 to distinguish between poor, fair, good, and excellent inter-rater reliability.262728 For κ, weighted κ, and intraclass correlation, the cut-off levels were <0.40 (poor), 0.40-0.59 (fair), 0.60-0.74 (good), and 0.75-1.00 (excellent); for percentage agreement, the levels were <70% (poor), 70-79% (fair), 80-89% (good), and 90-100% (excellent), taking into account that percentage agreement does not account for an agreement of raters by chance. Biometricians acknowledge that these guidelines are broadly accurate with some arbitrariness. Though at times they might come up with conflicting results, they have proved valuable in clinical application.28
No patients were involved in setting the research question, in developing plans for design, interpretation, reporting or implementation of the study. We plan to disseminate the results of this study to organisations supporting patients with disabilities.
Of 4562 potentially relevant citations identified, 101 reports proved potentially eligible after we had screened titles and abstracts. On full text screening, 23 studies,911222324293031323334353637383940414243444546 including four non-English studies,9394041 proved eligible for analysis (fig 1⇓). All studies were published from 1992 onwards and enrolled disability claimants from 12 countries in Europe, North America, Australia, the Middle East, and northeast Asia. Seven studies (30%) were conducted in the Netherlands. Seventy percent of the studies (16/23) were conducted in an insurance setting, with the remainder in a research setting. Investigators used a broad spectrum of designs, ranging from real life disability evaluations, videotapes with actors, and records of claimants to 10 line case vignettes, to perform reliability studies. Study size varied considerably with number of raters from two to 103 and number of patients from one and 3562 per study (tables 1 and 2⇓).
Assessment of methodological quality included blinding of raters to each other’s findings, presence of order effects, and appropriateness of the statistical analyses (table 3⇓; appendix 2). The studies on the reproducibility between medical experts conducted in an insurance setting met 80% (31/39) of these items, 15% (6/39) remained unclear, and 5% (2/39) were not applicable. The methodological quality items did not fit the design of the studies that looked at the reproducibility between medical experts and health professionals. Studies conducted in a research setting met 52% (11/21) of the quality items; 33% (7/21) remained unclear and 14% (3/21) were not met (table 3⇓).
With regards to generalisability of the findings to real life disability evaluation, 44% (7/16) of studies in the insurance setting were rated as “generalisable,” 19% (3/16) as “probably generalisable,” and 37% (6/16) as “probably not generalisable” (table 4⇓). Kendall’s W for reviewers’ concordance in ranking generalisability was 0.93, with a rank correlation of 0.89, confirming high agreement among the raters’ rankings.
In the insurance setting, 13 studies including 463 patients and 367 raters explored agreement between medical experts (two or more experts assessing the same patient) (table 1⇑; appendix 4).9112223243234373943444546 Three studies including 3729 patients (with 3562 patients from a single centre33) and eight raters (information was lacking from one study33) explored agreement between medical experts and claimant’s treating physicians33 or independent rehabilitation or occupational health teams with a mandate to care.3842 The median number of patients per study was 13.5 (range 1-3562), and the median number of raters per study was 12 (2-103, excluding one study that did not report the number of raters33). All but three studies24342 used a fully crossed design (that is, all raters evaluated all patients), with a median of 11 patients (range 1-180) per rater and a median of 11.5 raters (2-103) per patient.
Table 5⇓ summarises claimants’ characteristics. Studies focused on mental health (n=6), musculoskeletal disease (n=4), and mixed disorders (n=6). They enrolled patients with chronic diseases (n=11), chronic injuries (n=2), or mixed, acute, and chronic conditions (n=3). Most referred to a long term time frame before the evaluation for judging health status and work disability and predicted a long term perspective exceeding six months. Most studies used professional expertise only to generate a global rating of work ability (n=10). Six administered one or more specific rating instruments; five were referenced (appendix 3), and none was reported as validated.
Work disability outcomes varied considerably between studies and included a broad spectrum of domains, definitions, and measurement approaches, ranging from work ability to the employee’s readiness and ability to return to work, the degree of disability or handicap, or reduction in working hours. Measurement approaches included scales, scores, and categories (table 6⇓).
Studies conducted in a research setting included 371 patients and 32 raters (table 2⇑; appendix 4). Four studies reported on instrument development,29353641 and three studies validated existing instruments.303140 The median number of patients per study was 39 (range 20-180), and the median number of raters per study was three (2-18). All but two studies2940 used a fully crossed design, with a median of 21 patients (11-42) per rater and a median of two raters (2-4) per patient.
All studies were conducted with actual patients and focused on acute and chronic mental health conditions. Most used a short term time frame before the evaluation for judging occupational functioning, two provided a short term prognostic judgement on occupational functioning, and this information was missing in five studies (table 2⇑). All seven studies used instruments of varying complexity to elicit or to report capacities or limitations to determine a global rating for occupational functioning (appendix 4). All studies generated global ratings on a range of outcomes for occupational functioning, such as “occupational functioning” or “remunerative employment” (table 7⇓).
Overall, across all conditions and outcomes, the median inter-rater reliability was 0.45, ranging from ICC of 0.86 (musculoskeletal disorders; reduction in working hours22) to κ of –0.10 (narcolepsy; disability benefit11) (table 8⇓). Six studies reported excellent or good inter-rater reliability for a global rating of work disability, with ICCs of 0.6446 and 0.65,44 percentage agreement 82.4% (“return-to-work” recommendations37), or κ of 0.8023 and 0.8622 for reduction in working hours. One study presented mixed judgments in a single case, which we considered overall as “good agreement” based on the relative importance of the outcomes of functional ability to work (91.2% agreement on remaining work ability) and for work recommendations (86% agreement on limitations in work performance) over the outcome of readiness and ability to return to work (56% agreement on reduction in working hours).39 All Dutch studies used one or more rating instruments for determining functional limitations.22234446 Two studies qualified as “generalisable,”2223 two as “probably generalisable,”3739 and two as “probably not generalisable.”4446
Seven studies reported fair or poor inter-rater reliability across all global ratings of work disability outcomes. All but one24 based their judgments exclusively on professional expertise. One study presented discordant judgment on a single case9 (one third of experts each rated “full,” “partial,” or “no work ability” for the same patient). Three studies qualified as “generalisable” and four as “probably not generalisable.”
Overall, across conditions and outcomes, percentage agreement ranged from 51% (work ability in last job)33 to 4% (somatic occupational disorders; four disability items)38 (table 8⇑). Three studies compared reproducibility of ratings on work disability between experts and health professionals with a mandate to care.333842 One study reported poor agreement between experts and the claimants’ treating physicians.33 Another study reported highly discordant judgments on disability between medical experts and health professionals of an occupational health centre.38 The third study found poor agreement between the decisions of the social security administration and those of an independent rehabilitation team.42
The direction of disagreement was mixed. Medical experts approved higher levels of work ability for claimants33 or their recommendations and decisions favoured the insurer,38 while in the third study, the rehabilitation team was more reluctant to grant disability benefits to patients with mental disorders than the social security administration.42 All studies based their judgments exclusively on professional expertise. Two studies qualified as “generalisable,”3338 one as “probably generalisable.”42
Overall, across conditions and outcomes, the median inter-rater reliability was 0.76, ranging from an ICC of 0.91 (anxiety and mood disorders; occupational functioning35) to κ of 0.53 (mixed mental disorders; occupational functioning31).
Five of seven studies (71%) reported excellent (global) inter-rater reliability on work disability judgements with ICCs ranging from 0.7540 to 0.91.35 The remaining two studies3031 reported agreement on single items: good agreement (κ 0.62) regarding the ability to engage in remunerative employment30 and fair agreement (κ 0.53) for difficulties encountered in day-to-day work (occupational functioning).31
Testing the relation between inter-rater reliability and subjective (versus objective) and chronic (versus acute) health conditions as well as the studies’ overall generalisability did not show any association (subjectivity, 23 studies, P=0.46; chronicity, 20 studies, P=0.45; generalisability, 16 studies, P=0.65). Testing the relation between the level of standardisation and inter-rater reliability in all 23 studies showed a highly significant association (P=0.006).
Current evidence regarding reliability of disability evaluation is limited and shows highly variable agreement between medical experts. Higher agreement seems to be associated with the use of a standardised approach to guide judgment and studies in a research (manufactured) setting.
Strengths of our study include broad inclusion criteria to define eligibility and inclusion of publications in any language, which increases confidence that we captured all studies eligible for our review. Our outcome—global rating of disability for work—is highly relevant to the practice of medical experts, disability insurers, and employers, which increases the practical implications of our findings. Further, we evaluated the generalisability of evidence by following international guidance for evaluating reliability studies171920 and by using an explicit approach in eliciting reviewers’ judgments on the relative weights of the generalisability items. While the high agreement we found among reviewers strengthens the credibility of the results, this approach requires further validation. Some cut offs of the generalisability criteria (such as number of raters) are context specific and might not be applicable to settings other than assessment of disability. Furthermore, variability of study designs, measures of agreement, and outcomes precluded statistical pooling across studies.
Disability evaluation is a poorly understood process141516 that lacks any reference standard to confirm the validity of the findings. Health professionals who perform this task assess medical restrictions and limitations of claimants and are often asked to infer consequences on the ability to work. This, however, requires expertise in vocational rehabilitation, as medical restrictions do not correlate well with function and the ability to work.5 In such situations, reliability studies evaluate the measurement properties of observers.47 At each step of disability evaluation, multiple sources of variation come into play (box 1),1516 including experts’ personal attitudes, beliefs, and values towards disability, all of which affect the global judgment of work disability. Left unmanaged, these sources of variation can lead to low inter-rater reliability.
We found higher agreement when disability evaluation was guided by a standardised instrument. Instruments that standardise the collection, interpretation, and reporting of information are one promising approach to reduce variation.15 Five of the seven Dutch studies that used instruments to guide assessment of work disability all achieved fair to good reliability. As all Dutch insurance physicians undergo four years of specialty training in insurance medicine,48 however, we cannot disentangle whether higher agreement is a result of use of a formal instrument or calibration by training, or both.
We did not detect any association between inter-rater reliability and subjectivity or chronicity of the health conditions, or overall generalisability to real world disability evaluation. The low number of studies in the analyses, however, precludes any premature conclusions that such associations do not exist.
Not all sources of variations are easily accessible to change. Other sources, in particular attitudes, beliefs, and value judgments, will require other approaches.49 Implicit in the use of evaluations of disability by a third party is the concern that treating clinicians could have difficulty providing impartial assessments of their patients. Indeed, our findings suggest that medical experts (versus treating physicians) are more likely to conclude that claimants are capable of working. Claimant lawyers and patients’ organisations have raised concerns that experts who are paid to assess claimants for insurers might feel pressure to render opinions that favour the referral source.
Our review suggests that use of standardised instruments could improve reliability in expert judgments on work disability. Appropriate instruments should therefore be considered in routine practice of disability evaluations (see table 8⇑ and appendix 3 for examples). To ensure appropriate administration and interpretation of the findings, experts will need appropriate training and calibration on the use of such instruments. As most instruments reported in this review are available only in Dutch, other countries would need to develop their own instruments or translate instruments and accompanying manuals in national languages.
As few countries have standards to guide assessments, standardised instruments that improve reliability could become a target for change and parties ordering assessments should demand their use.
Given the widespread use of evaluation of disability for work to determine claimants’ eligibility for work replacement benefits, our findings suggest that further research to improve reliability is urgently needed. Promising targets include formal training in evaluation of capacity to work,50 use of standardised instruments to guide disability evaluations,50 and addressing the conflict of interest that arises when insurers (or lawyers) select their own experts. Further, there might be greater need for strategies to improve agreement when patients present with subjective complaints. Ikezawa and colleagues found that different medical experts were able to agree on claimant’s ability to return to work in 97% of claims involving a fracture and 94% of claims involving a dislocation, but only 56% of claims because of chronic low back pain.37 Our review further suggests that interventions should be validated in real insurance settings, as experimental settings could artificially inflate agreement.
Improved knowledge of individual factors that contribute to variability in evaluation of capacity to work is also needed. Promising targets could provide a starting point to develop and test focused strategies to reduce variability (for example, appropriate assessment tools, guidelines, standard cases). Guidance is also required to inform the required level of inter-rater reliability to ensure equal treatment of claimants. Any decision on what constitutes an appropriate threshold, which might be similar to thresholds for clinical medical tests,2728 will require societal discussion on what constitutes acceptable differences in the treatment of claimants or align to standards set by professional organisations of psychology or education. To make evaluations on work disability fair and meaningful and thereby qualify for decisions on claimants’ disability benefits, however, we suggest a minimum intraclass correlation coefficient of 0.6 (the cut off between fair and good inter-rater reliability), with a sufficiently narrow 95% confidence interval (0.5 to 0.7) to exclude poor reliability.
Despite their widespread use, medical evaluations of work disability show high variability and often low reliability. Use of standardised and validated instruments to guide the process could improve reliability. There is an urgent need for high quality research, conducted in actual insurance settings, to explore promising strategies to improve agreement in evaluation of capacity to work.
Social and private disability insurers use medical experts to evaluate claimants with impaired health to determine eligibility for disability benefits
Anecdotal evidence suggests that experts often disagree in their judgment of capacity to work when assessing the same claimant
This systematic review of 23 reproducibility studies from 12 countries shows a lack of good quality data applicable to the real world of disability assessment
In most studies, medical experts reached only low to moderate reproducibility in their judgment of capacity to work
Studies reported higher reproducibility when experts used a standardised evaluation procedure
These findings are disconcerting and call for substantial investment in research to improve assessment of disability