What is the value of musculoskeletal ultrasound in patients presenting with arthralgia to predict inflammatory arthritis development? A systematic literature review

Objective Musculoskeletal ultrasound (US) is frequently used in several rheumatology practices to detect subclinical inflammation in patients with joint symptoms suspected for progression to inflammatory arthritis. Evaluating the scientific basis for this specific US use, we performed this systematic literature review determining if US features of inflammation are predictive for arthritis development and which US features are of additive value to other, regularly used biomarkers. Methods Medical literature databases were systematically searched up to May 2017 for longitudinal studies reporting on the association between greyscale (GSUS) and Power Doppler (PDUS) abnormalities and inflammatory arthritis development in arthralgia patients. Quality of studies was assessed by two independent reviewers using a set of 18 criteria. Studies were marked high quality if scored ≥ 80.6% (which is the median score). Best-evidence synthesis was performed to determine the level of evidence (LoE). Positive and negative likelihood ratios (LR+, LR−) were determined. Results Of 3061 unique references, six fulfilled inclusion criteria (three rated high quality), of which two reported on the same cohort. Heterogeneity in arthralgia populations, various US machines and scoring systems hampered the comparability of results. LoE for GSUS as predictor was limited and moderate for PDUS; LoE for the additive value of GSUS and PDUS with other biomarkers was limited to moderate. Estimated LR+ values were mostly < 4 and LR− values > 0.5. Conclusions Data on the value of GSUS and PDUS abnormalities for predicting inflammatory arthritis development are sparse. Although a potential benefit is not excluded, current LoE is limited to moderate. Future studies are required, preferably performed in clearly defined, well-described arthralgia populations, using standardized US acquisition protocols and scoring systems. Electronic supplementary material The online version of this article (10.1186/s13075-018-1715-8) contains supplementary material, which is available to authorized users.


Background
The development of rheumatoid arthritis (RA) is supposed to consist of several stages: a) genetic risk factors for RA; b) environmental risk factors for RA; c) systemic autoimmunity associated with RA; d) symptoms without clinical arthritis; e) unclassified arthritis (UA); f ) RA [1]. The phase of arthralgia preceding clinical arthritis (phase d) is of particular interest since it is hypothesized that disease-modifying treatment initiated in this phase might result in better disease outcomes than when initiated in the phases of UA and RA [2]. However, musculoskeletal symptoms such as arthralgia are prevalent, and arthralgia is frequently not related to imminent RA. In order to identify arthralgia patients at risk for RA, different strategies can be undertaken, such as selecting arthralgia patients based on clinical features associated with RA development, using autoantibody tests or imaging to detect subclinical inflammation, or a combination of these.
Musculoskeletal ultrasound (US) is a frequently used imaging modality as it is fast, easy to apply, and readily accessible. Although US is frequently used in patients presenting with arthralgia (as also proposed in an algorithm for the pragmatic use of US [3]) in several rheumatology practices, we questioned what the scientific basis is to use US as a predictor for future inflammatory arthritis development. Therefore, we systematically studied the literature to determine if US features of inflammation are predictive for inflammatory arthritis development and, if so, to determine which US features are of additive value to other regularly used biomarkers, with the ultimate goal of obtaining evidence-based information on the value of US in patients presenting with arthralgia.

Systematic literature search
The PRISMA guidelines were followed [4]. Search strategies were built in collaboration with an experienced librarian (WB) and executed in electronic medical literature databases (Embase.com, Medline Ovid, Web of Science, Scopus, Cochrane Central, Google Scholar) up to 11 May 2017 (complete searches in Additional file 1: File S1). Reference lists of the included papers were checked for additional papers and unpublished and ongoing trials were identified using the World Health Organization (WHO) International Clinical Trials Registry Platform (ICTRP) search portal (http://apps.who.int/trialsearch/) and Clini-calTrials.gov (http://clinicaltrials.gov).

Selection of studies based on inclusion and exclusion criteria
Two reviewers (SO, RvdB) assessed each title for suitability for inclusion in this review, according to predetermined inclusion and exclusion criteria. Next, abstracts were retrieved for detailed review and, finally, full-text papers were assessed if further information was required. Papers not addressing the topic of interest were excluded and reasons for exclusion recorded.
From the total number of studies identified by the database search, studies were included if the following inclusion criteria were met: 1) investigation of subjects without clinical arthritis, suffering from arthralgia, regardless of rheumatoid factor (RF) and anti-citrullinated protein antibody (ACPA) status or ACPA+ musculoskeletal symptoms; 2) investigation of small hand and/or feet joints of subjects using US; 3) joints and/or tendons were assessed for inflammatory features (GS synovial hypertrophy and/or PDUS); 4) subjects were followed prospectively; 5) development of (persistent) inflammatory arthritis or RA was defined as outcome. Studies about other inflammatory joint conditions, animal studies, reviews, letters to the editor, case reports, case series, commentaries, guidelines, editorials, abstracts, study populations < 18 years of age, and studies in languages other than English, Dutch, and German were excluded.

Data extraction
The two reviewers independently assessed the full texts of the included studies using a predefined sheet to extract data about: 1) study population (number of patients, age, gender, symptom duration); 2) follow-up period; 3) musculoskeletal US equipment (producer, transducer, machine setting, mode (GSUS/PDUS); 4) US acquisition (number and type of examined joints, examined pathology, scoring method, potential used cut-off); 5) longitudinal outcome.
Data from univariable analyses were extracted to answer the first aim; data from multivariable analyses were extracted to answer the second aim on added value.

Quality assessment and analyses
Due to heterogeneity of the studies, it was not possible to perform meta-analyses and calculate pooled effect estimates. Therefore, we performed a best-evidence synthesis based on the guidelines on systemic review of the Cochrane Collaboration Back and Neck (CBN) Group [5], a method summarizing the level of evidence (LoE) in observational studies if study population, outcomes and data analyses are heterogenic (Additional file 1: Table S1). LoE is based on presence of statistical significance, which depends on sample sizes, taking into account the quality of the studies. Quality of the studies was evaluated by the two reviewers individually, using a set of 18 criteria based on previous systematic reviews in prognostic factors in the field of musculoskeletal disorders [2,6]. This list included seven criteria specifically for the use of US, of which three were considered mandatory (Additional file 1: Table S2). A study was considered high quality if all three mandatory criteria were fulfilled and the total score was ≥ 80.6% (median of quality scores obtained in this review).
Positive and negative likelihood ratios (LR+ and LR−, respectively) and positive and negative predictive values (PPV and NPV, respectively) were calculated based on presented data regarding outcome (using the presented follow-up duration ( Table 1)) to evaluate the predictive accuracy. Also, due to heterogeneity, no summary estimates were calculated.

Selection and inclusion of articles
In total, 5028 titles were identified and, after removing duplicates, 3061 unique references were screened (Additional file 1: Figure S1). After detailed review, six full-text papers fulfilled the inclusion and exclusion criteria (Table 1) [7][8][9][10][11][12], of which two studies reported on the same cohort [10,11]. One of them reports on dichotomous PDUS results only and the other presents PDUS and GS synovial hypertrophy results for various cut-offs.

Quality assessment
The two reviewers rated 108 items and agreed on 98 (91.6%); disagreement on items was solved by discussion (Additional file 1: Table S3). All six included studies fulfilled the three mandatory criteria. Median quality score was 80.6% (range 61.1-83.3%). Two of the three high-quality papers described the same cohort [8,10,11].

Study characteristics
The number of included patients varied between 80 and 379; the majority were female (69-83%) aged > 50 years. None of the studies had stringent inclusion criteria with respect to symptom constitution. The cohort described in the papers by Nam et al. [10] and Rakieh et al. [11] included ACPA+ patients with new onset musculoskeletal symptoms from primary care physician clinics and the rheumatology early arthritis clinic in Leeds. In the study of Van der Ven et al. [8], patients with inflammatory joint complaints involving at least two joints in the hands, feet, or shoulders for < 1 year which could not be explained by other conditions were included if they had also at least two of the following criteria: morning stiffness for > 1 h, unable to clench a fist in the morning, pain when shaking someone's hand, pins and needles in the fingers, difficulties wearing rings or shoes, family history of RA, and/or unexplained fatigue. In the paper by Zufferey et al. [7], ACPA-and RF-negative patients with polyarthralgia for > 6 weeks with an inflammatory or mixed (mechanical and inflammatory) character referred by their general practitioner or rheumatologist were included. Van de Stadt et al. [12] recruited ACPA+ and/or RF+ patients with arthralgia, defined as "non-traumatic pain in any joint", at rheumatology clinics in Amsterdam after referral by their general practitioner. Patients presenting with new-onset arthralgia to the Newcastle Early Arthritis Clinic were included in the study by Pratt et al. [9], but no description of arthralgia was provided.
Two studies reported on inter-observer reliability, which was moderate (kappa = 0.56 for GS synovial hypertrophy) to substantial (kappa = 0.64 for PDUS) [9] in one study, and fair (kappa = 0.22 for effusion) to moderate (kappa = 0.47 for synovitis) and substantial (kappa = 0.67 for PDUS) in another study [12], yet good in terms of overall percentage agreement (88-92%).

Outcome
Outcome was defined as RA (ACR/EULAR 2010 criteria [21]) in one study and (persistent) (inflammatory) arthritis in the remaining five. Outcome was reached in 8.8-50.0% of patients; frequency was lowest in ACPA-/ RF-negative populations and highest in ACPA+/RF+ populations. Duration until outcome was reached varied between 7.9 and 18.3 months and was not specified in two studies ( Table 1).

LoE of GSUS and PDUS abnormalities as predictor for arthritis development
The prevalence of different US features varied per patient group and cut-off used. For GS synovial hypertrophy it ranged from 11.6 (GSUS ≥ 2 in patients without arthritis development) to 77.2% (GSUS ≥ 2 in patients that developed arthritis); for PDUS from 6.3 (PDUS = 2 in patients without arthritis development) to 44.0% (PDUS ≥ 1 in patients that developed arthritis)  *Tender joints at physical examination were scanned, otherwise joints that were painful by history were scanned. For MCP, PIP, and MTP joints the directly adjacent joints in the same joint group as the painful joints were scanned ( Table 2). The prevalence of tenosynovitis ranged from 6.1 (GSUS ≥ 2 in patients without arthritis development) to 8.9% (GSUS ≥ 2 in patients with arthritis development).

Tenosynovitis
One low-quality study evaluated tenosynovitis and found no statistically significant association with arthritis development (OR 1.50 [95% CI 0.44-5.11]) [12]. Hence, LoE with regard to the predictive value of tenosynovitis is insufficient.

LoE of GSUS and PDUS abnormalities being additive to other biomarkers
Three studies investigated the association of GS synovial hypertrophy with arthritis development, correcting for different biomarkers (  [7,9]. One high-quality study reported a statistically significant association of a "positive US" (GSUS ≥ 2 and/or PDUS ≥ 1; OR 2.65 [95% CI 1.44-4.88]) [8]. Hence, LoE with regard to the question of whether GS synovial hypertrophy may have value in predicting arthritis development, additive to regularly assessed biomarkers, is moderate. Likewise, two studies performed multivariable analysis with PDUS. After correction for (different) biomarkers (Table 1), one high-quality study reported a statistically significant association (OR 3.44 [95% CI 1.71-6.95]) [8].
The other high-quality study reported a non-significant association (HR 1.51 [95% CI 0.83-2.74]) [11]. Hence, LoE of the value of PDUS in addition to other biomarkers is limited.
The value of tenosynovitis (GS/PD) in addition to other biomarkers was not investigated.
Predictive values are directly proportional to disease prevalence. Percentages of patients that developed arthritis varied between 8.8 and 50%; thus, prior risks for not progressing were 50-91.2%. We calculated the increase in the absolute risks of inflammatory arthritis provided by US-detected abnormalities by comparing PPV and NPV with prior risks (Additional file 1: Table S4). Overall, PPVs were low or moderate (23.5-71.9% for GS synovial hypertrophy; 30.3-75% for PDUS) and the increase in absolute risks in US-positive patients ranged from 5.8-29.2% (GS synovial hypertrophy) and 6.9-33.1% (PDUS). NPVs were higher (68.9-96.7% for GS synovial hypertrophy; 58.2-85.1% for PDUS), but the gain in relation to prior risk of not progressing to arthritis was relatively small (0.8-12.5% for GS synovial hypertrophy; 2.9-13.9% for PDUS). Thus, NPVs were largely explained by prior risks of not developing inflammatory arthritis.

Discussion
The aim of this systematic literature review was to determine if US features of inflammation are predictive for inflammatory arthritis development and, if so, which US features are of additive value to other regularly used biomarkers. LoE for GS synovial hypertrophy as predictor for arthritis was limited and moderate for PDUS. LoE for the additive value of GS synovial hypertrophy and PDUS with other regularly used biomarkers was limited to moderate. Additionally, there was insufficient data on the value of US-detected tenosynovitis. Thus, there is a discrepancy between the frequent use of US in arthralgia patients to search for subclinical inflammation (which, if present, is generally considered a sign of imminent RA) in several rheumatology practices and the absence of strong scientific evidence on its prognostic value.
The limited/moderate LoE might be explained by relatively low number of studies and the presence of different types of heterogeneity. Only six studies were included in this systematic literature review, of which two described the same cohort. The number of included patients per study was rather low, influencing the power to achieve statistical significance. Furthermore, heterogeneous arthralgia populations (seropositive arthralgia, seronegative arthralgia, ACPA+ patients with unspecific musculoskeletal (MSK) symptoms) were studied in different settings (primary and/or secondary care), with slightly differently defined outcomes ((persistent) (inflammatory) arthritis, RA), contributing to the various ranges of frequencies of outcome (8.8-50%).
Moreover, the US acquisition protocol, definitions of pathology, and scoring systems varied, although all followed internationally recognized recommendations and scoring systems [13][14][15][16][17][18][19][20]. Only very recently, EULAR/ OMERACT published a standardized, consensus-based semi-quantitative scoring system for GS synovial hypertrophy and PDUS (separately and combined) [24,25], but this was not available when the studies included in this review were executed.
Other sources of heterogeneity were the selection of assessed joints, whether they were scanned from a volar or dorsal aspect, and the fact that different machines were used. It is known that the diverse machines have a wide variation in sensitivity to pick up inflammation, especially with regard to Doppler modalities [26]. Three studies used a transducer with 12 or 13 MHz as maximum, while higher frequencies are recommended especially for scanning small hand joints. Ideally, in order to arrive at a higher LoE, future studies should be performed in more homogeneous arthralgia populations (e.g., fulfilling the EULAR definition of arthralgia at risk for RA [27]), using the same scan and scorings protocols (e.g., EULAR/ OMERACT [24,25]).
Another issue is the definition of a "positive US". Different cut-offs were applied and none of the studies included information on US findings in healthy volunteers. It has been shown that a cut-off incorporating such findings increased the prognostic value for the use of MRI in arthralgia patients [28]. Also US "inflammatory features" can be detected in healthy volunteers, especially in certain joints and increasing with age [29][30][31][32][33][34][35][36]. Whether incorporating age-dependent US reference values might increase the predictive value of US remains to be determined. Pratt: a GSUS sum score ≥ 2; b GSUS sum score/6 joints (worst hand) ≥ 2; c GSUS number of joints ≥ 1: ≥ 3; d PDUS sum score ≥ 1; e PDUS number of joints ≥1: ≥ 2. Zufferey: a B-mode score > 8 (of total possible score of 66); b ≥ 2 joints (of total number of 22 joints) with grade ≥ 2 synovitis [18]. Likelihood ratio values between 0 and 1 decrease the probability of disease; values greater than 1 increase the probability of disease. An LR of 1 does not influence the probability. In general, an LR+ of 2 results in an approximate change of + 15% in post-probability; an LR+ of 5 in an approximate change of + 30% and an LR+ of 10 in an approximate change of + 45%. An LR− of 0.5 results in an approximate change of − 15% in post-probability; an LR− of 0.2 in an approximate change of − 30% and an LR− of 10 in an approximate change of − 45%. These estimations are accurate for pre-test probabilities between 10% and 90% [23]