Ultrasound of metacarpophalangeal joints is a sensitive and reliable endpoint for drug therapies in rheumatoid arthritis: results of a randomized, two-center placebo-controlled study

Introduction We aimed to investigate the sensitivity and reliability of two-dimensional ultrasonographic endpoints at the metacarpophalageal joints (MCPJs) and their potential to provide an early and objective indication of a therapeutic response to treatment intervention in rheumatoid arthritis (RA). Methods A randomized, double-blind, parallel-group, two-center, placebo-controlled trial investigated the effect on ultrasonographic measures of synovitis of repeat dose oral prednisone, 15mg or 7.5mg, each compared to placebo, in consecutive two-week studies; there were 18 subjects in a 1:1 ratio and 27 subjects in a 2:1 ratio, respectively. All subjects met the 1987 American College of Rheumatology criteria for the diagnosis of RA, were ≥18 years-old with RA disease duration ≥6 months, and had a Disease Activity Score 28 based on C-reactive protein (DAS28(CRP)) ≥3.2. Subjects underwent high-frequency (gray-scale) and power Doppler ultrasonography at Days 1 (baseline), 2, 8 and 15 in the dorsal transverse and longitudinal planes of all 10 MCPJs to obtain summated scores of quantitative and semi-quantitative measures of synovial thickness as well as vascularity. The primary endpoint was the summated score of power Doppler area measured quantitatively in all 10 MCPJs in the transverse plane at Day 15. Clinical efficacy was assessed at the same time points by DAS28(CRP). Results All randomized subjects completed the trial. The comparison between daily 15 mg prednisone and placebo at Day 15 yielded a statistically significant treatment effect (effect size = 1.17, P = 0.013) in change from baseline in the primary endpoint, but borderline for prednisone 7.5 mg daily versus placebo (effect size = 0.61, P = 0.071). A significant treatment effect for DAS28(CRP) was only observed at Day 15 in the prednisone 15 mg group (effect size = 0.95, P = 0.032). However, significant treatment effects at all time points for a variety of ultrasound (US) endpoints were detected with both prednisone doses; the largest observed effect size = 2.33. Combining US endpoints with DAS28(CRP) improved the registration of significant treatment effects. The parallel scan inter-reader reliability of summated 10 MCPJ scores were good to excellent (ICC values >0.61) for the majority of US measures. Conclusions Ultrasonography of MCPJs is an early, reliable indicator of therapeutic response in RA with potential to reduce patient numbers and length of trials designed to give preliminary indications of efficacy. Trial Registration Clinicaltrials.gov identifier: NCT00746512


Introduction
The development of new therapeutics for rheumatoid arthritis (RA) involves clinical assessment of response by endpoints that include composite measures of disease activity, such as the Disease Activity Score in 28 Joints (DAS28) [1], a continuous measure, and American College of Rheumatology (ACR) categorical responses [2,3]. Many of the component measurements are subjective, imprecise and insensitive to change and their use often necessitates lengthy and costly clinical trials using large cohorts of patients. This results in greater exposure to experimental drugs in early testing, many of which will eventually fail to receive approval.
For early testing of novel therapeutics, we require a sensitive method to distinguish between treatment groups in cohort studies that permits small patient numbers and provides a reliable, early indicator of efficacy. Ideally, such measures would be quick, non-invasive, objective, predict longer-term response to repeated medication and give an early indication of disease modification. Due to ethical constraints of performing placebo controlled trials and the resultant trend towards comparator controlled trials, the requirement for sensitive endpoints is greater than ever.
Metacarpophalangeal joints (MCPJs) are invariably involved in RA [4] and so their evaluation is important. These superficial joints are amenable to assessment with ultrasound (US) utilizing frequencies that produce high resolution images. High-frequency ultrasonography (HFUS) and power Doppler ultrasonography (PDUS) are reproducible tools for determining synovitis and more sensitive than clinical scoring in determining disease activity [5,6]. The synovial vascular signal on PDUS closely correlates with the dynamic contrast enhanced magnetic resonance imaging (MRI) in RA MCPJs [7,8] and synovitis detected by US predicts erosive disease [9][10][11][12].
By using a known efficacious treatment for RA, our objectives were: 1. To investigate the sensitivity and reliability of twodimensional ultrasonographic endpoints (quantitative and semi-quantitative measures of synovial thickness and vascularity in MCPJs imaged in the dorsal longitudinal and transverse planes) and make comparisons between different endpoints. We have investigated the reliability of a summation of 10 MCPJs rather than the reliability on a joint by joint basis.
2. To determine the potential of two-dimensional ultrasonographic endpoints to provide an early and objective indication of a therapeutic response to treatment intervention in rheumatoid arthritis (RA).
3. To determine if there is a dose-response relationship between the two different relatively low, corticosteroid doses (15 mg and 7.5 mg) and ultrasonographic endpoints. 4. To compare the US endpoints with DAS28(CRP) (C-reactive protein) and to explore the potential of composite endpoints (DAS28 combined with US endpoints) to improve the registration of a significant treatment effect.

Patients
Protocol 088 (clinicaltrials.gov identifier: NCT00746512) was a randomized, double-blind, parallel-group, placebocontrolled trial conducted at two academic research centers in the UK. Two panels were planned for the study: In Panel A, subjects were randomized in a 1:1 ratio to oral prednisone 15 mg daily or matching placebo for 15 days. After a total of 18 subjects completed the study, an interim analysis was planned to determine if this smaller sample size could significantly (alpha = 0.03, 1-sided) discriminate prednisone 15 mg from placebo based on the primary endpoint. If so, then enrollment in the prednisone 15 mg group would cease, and 27 additional subjects would be randomized (2:1 ratio) in Panel B to prednisone 7.5 mg or matching placebo for 15 days.
Two centers were chosen for feasibility of recruitment to the study. One of our general aims is to investigate the applicability of US endpoints in multi-center clinical trials.
Men and nonpregnant women ≥18 years old with RA for ≥ six months duration meeting the 1987 American College of Rheumatology criteria for the diagnosis of RA were eligible [13]. Subjects were required to have at least moderate disease activity (DAS28(CRP) ≥3.2) and moderate dorsal transverse synovial vascularity in two MCPJs (score ≥2) or severe in one MCPJ (score = 4) as measured on a semi-quantitative 0 to 4 scale.
Non-steroidal anti-inflammatory drugs (NSAIDs) at stable doses for ≥ four weeks were permitted, as were disease modifying anti-rheumatic drugs (DMARDs) at stable doses for ≥ six weeks, topical or inhaled glucocorticoids at stable doses for ≥ two weeks, and opiates at stable doses ≥ two weeks. Acetaminophen (paracetamol) was allowed for breakthrough pain, but NSAIDs were not to be taken on an as-needed basis.
Pertinent exclusion criteria included intra-articular glucocorticoid injections to MCPJs within three months or to non-MCP Js within six weeks of baseline; oral glucocorticoid use within four weeks; and current biological therapies.
Tolerability was assessed by clinical and laboratory examination and adverse event (AE) reporting during the study. After baseline measures subjects were randomized (by a sponsor statistician using a computerized Clinical Allocation Schedule System (CASS) with blocking factors to ensure blinding based on a multiple of the number of treatment groups and subjects) and then received their first dose of study medication.
All subjects gave informed written consent to participate. The study was conducted in accordance with the principles of Good Clinical Practice and approved by the institutional review board for human research.

Ultrasonography
Assessment of reliability has been complicated by a diversity of nomenclature employed by different investigators. Terms for assessment of reliability are redefined below to avoid confusion. The ultrasonographer is usually also a reader of the anonymized images.
Within scan intra-reader: one patient, one ultrasonographer acquires one scan set, one reader reads the scan set twice; each reading is separated by a fixed time period (previously called intra-reader [14] and intraobserver [10,15]).
Parallel scan intra-reader: one patient, one ultrasonographer acquires two scan sets sequentially, one reader independently reads both scan sets (previously called intraobserver [20]).
Imaging was performed at two centers (Kennedy Institute of Rheumatology (KIR) and St Bartholomew's and the London National Health Services Trust (B&L)) by two ultrasonographers (MS and SK, each with more than two years experience), both blinded to the subjects' group allocation. They spent approximately 16 hours together before the study to gain consensus on image acquisition and analysis. To determine reliability (within scan inter-reader, parallel scan intra-reader and parallel scan inter-reader) three scans were sequentially acquired at Day 1 (baseline) and on Day 15 according to Table 1.
Using a GE Logiq9 ultrasound machine with a twodimensional M12L transducer at each center, subjects underwent HFUS and PDUS scanning over the dorsum of all 10 MCPJs at Days 1, 2, 8 and 15 in the longitudinal and transverse (over the triangular structure -method previously described [27]) planes. Settings were identical on both GE Logiq9 ultrasound machines: HFUS (grayscale) -Frequency 14 MHz; PD -Frequency 7.5 MHz, Gain 41, PRF 1.4 kHz, Wall Filter 127 Hz. With a view to standardization of data acquisition, the hands were maintained in a position of rest by a splint (identical at both sites). The time of day of scanning at each visit was within 1 hour of the time of the baseline visit. Care was taken when scanning to avoid undue pressure with the probe in case this altered blood flow in the joint. This was achieved by maintaining a distance of at least 1 mm of gel between the probe and the subject as visualized on the US monitor.
Stored clips and images were anonymized before reading. Each PDUS scan consisted of a three second movie clip. PDUS measures were made on the image frame at the peak of the PDUS signal and synovial area measures were made from the first technically qualified image on HFUS imaging.
The Synovial Thickness Area (STA), a quantitative measure, is a count of the number of pixels within a defined region of interest (ROI) in a standardized twodimensional image of the joint. For the longitudinal STA (Long STA) the ROI should envelop the synovium over the phalangeal base, triangular structure, metacarpal head and metacarpal notch to the joint capsule superiorly. For the transverse STA (Trans STA) the ROI should envelop the MCPJ synovium from the lower border of the triangular structure (if bone, this is indicated by a continuous hyperechoic line or if cartilage by a homogenous anechoic line above bone) to the joint capsule superiorly (Figure 1). The transverse and longitudinal STA from each of the 10 MCPJs were summated in each respective plane to create the 10MCP Trans STA and 10MCP Long STA.
Synovial thickness (ST) was graded semi-quantitatively in each MCP joint against a standardized image set on an ordinal scale ranging from 0 to 4: 0, no synovial thickening; 1, minimal; 2; mild; 3, moderate; 4, severe ( Figure 1). The longitudinal and the transverse ROIs that were compared with the representative images were the same as the respective STA ROIs. The saved gray-scale image was compared with the library and MS and SK decided which representative image was the closest fit with regard to area of ST and allocated a score. The scores from each of The Power Doppler Area (PDA), a quantitative measure, is a count of the number of pixels with PDUS signal, uncorrected for pixel intensity, within a defined ROI in a standardized two-dimensional image of the joint. The ROIs for longitudinal and transverse PDA are the same as the corresponding ROI for STA and therefore extraarticular digital vessels are excluded. If present, reflection artifacts from digital vessels are also excluded if they enter the ROI (Figure 1). The transverse and longitudinal PDAs from each of the 10 MCPJs were summated in each respective plane to create the 10MCP Trans PDA and 10MCP Long PDA.
PDUS was also graded in each MCPJ using a semiquantitative 0-to-4 vascularity scale: 0, no PD signal; 1, minimal; 2; mild; 3, moderate; 4, severe. As for the PDA the longitudinal and the transverse ROIs were the same as the respective STA ROIs. Images were graded against a library of representative images (Figure 1), that is, for each selected image MS and SK visually estimated the amount of colored pixels within the joint capsule, compared this with the library, decided which representative image was the closest fit and allocated a score. The scores from each of the 10 MCPJs were summated to create a Vascularity Index (VASCi: minimum score of 0 and a maximum of 40) for each plane, the 10MCP Trans VASCi and 10MCP Long VASCi. The 10MCP Trans PDA was the primary endpoint; the other US endpoints were secondary. The transverse view was chosen as this had previously demonstrated its utility in differentiating two groups in a randomized placebo controlled trial [12].
Quantitative vascularity and ST measurements were analyzed using the free downloadable software program ImageJ version 1.41 with an in-house plugin written by BD which enabled a rapid review of each PD clip to find the frame that displayed the most activity.
Several published four point scales for power Doppler (PD) have no PD signal as the lowest grade (normal), presence of a single vessel (mild) as the next grade and then less than 50% (moderate) and greater than 50% (severe) PD signal to gray-scale signal within the ROI to determine the next two grades [6,21,26]. The majority of published four-point gray-scale semi-quantitative scales of synovial thickening/synovitis are explained simply as subjective grading with four points (normal, mild, moderate and severe) [6,21,23]. Those that are described in detail examine the extent at which synovitis breaches boundaries to determine the grade of synovitis [19,26]. Ordinal grading scales typically infer linearity and equidistance between grading points, whereas the above mentioned four-point vascularity scales lack grading points at low levels of Doppler signal, potentially underestimating change (either up or down) in response to therapy. Our rationale for increasing the semi-quantitative scales for scoring by one point was to provide semi-quantitative ultrasonographic outcome measures that are potentially more sensitive to change. We developed a five point scale for synovial vascularity. The basis for this is that within the above mentioned four-point vascularity scales there is potentially a leap between mild and moderate. In our experience there are often images that the four point scales [6,21,26] would class as moderate which we would still consider as mild, for example, two vessels and up to three small areas of confluence. We have made provision for this. On our scale these two examples would score two and, therefore, one vessel scores one on our scale, minimal. This additional grade may serve to improve the sensitivity. Therefore, another objective of this study was to investigate this five-point scale for vascularity and also the novel five-point scale for ST. In support of the library of images approach to semi-quantitative scoring of synovitis, a recent study that utilized a US atlas has reported excellent parallel scan inter-reader reliability (intra-class correlation coefficients (ICC) values: gray-scale 0.95 and PD 0.97) [28].

Clinical efficacy assessments
Clinical efficacy was assessed by the DAS28(CRP), which includes the number of swollen and tender joints (28joint count), a patient's global assessment of arthritis index (visual analogue scale) and CRP. Assessments were performed by a single rheumatology research nurse in each center, independent of the US examinations. Each study nurse attended a DAS28 standardization training course within the preceding year of the study start.

Statistical analysis
The primary hypothesis for both panels was that prednisone (15 mg or 7.5 mg) would have a greater change from baseline in 10MCP Trans PDA (the primary endpoint) after 15 days of treatment. The analysis was performed using an analysis of covariance model with panel and treatment nested within panel as factors, and baseline value included as a covariate. Only observed data were analyzed; missing data were not imputed. The analyses were carried out for change from baseline at each of days 15, 8, and 2. Interpretation of P-value testing for each endpoint was made in a step-down fashion, in that order, at α = 0.05 (1-sided) for the primary endpoint. For Panel A Days 1 and 15, the first scan set acquired and read by the ultrasonographer-reader associated with the clinical site was used for analysis of treatment effects and for correlation with other endpoints; there was one scan set per visit for all other study time points.
For the interim analysis following Panel A, if the true underlying effect size for the primary endpoint was 1.0, the overall power, accounting for the interim and a potential final analysis if the study continued as originally planned, was approximately 88% for 18 subjects per group and approximately 83% for 15 subjects per group. The probability of stopping at the interim analysis was approximately 63%. These computations employ the Hwang, Shih, deCani gamma = 1 stopping criteria which yields α = 0.03, 1-sided at the interim and, if applicable, final analyses. This controls the overall alpha level at 0.05, 1-sided. If the study continued after the interim analysis, a sample size of 15 subjects per group had 80% power to detect a statistically significant (α = 0.05, 1-tailed) difference between prednisone and placebo assuming an effect size of 0.93.
Significance for the effect of the secondary imaging endpoints and DAS28(CRP) was not error-protected from chance significance associated with multiple comparisons. The rationale for this choice is to minimize the chances of false negatives because of the exploratory nature of these secondary endpoint analyses. Thus, they are viewed as hypothesis generating rather than conclusive.
The repeated measures on Days 1 and 15 of Panel A were used to assess reproducibility (Table 1) using intra-class correlation coefficients for each endpoint. Pearson correlation coefficients were computed among pairs of endpoints, including DAS components, with US endpoints, to assess association among them.
For the analysis of creating a composite PDUS and DAS28 endpoint, precision was measured by effect size (average effect of the 15 mg and 7.5 mg prednisone effects minus placebo divided by pooled standard deviation) using a model with panel and treatment nested within panel as factors. The method for building composites was the O'Brien global statistic which is the average of the standard z-scores across the endpoints.
All tests were performed one-sided at level 0.05.

Patients
Baseline subject and disease characteristics are shown in Table 2. Due to an error in the randomization code, there was uneven subject allocation in Panel A. There were no significant differences between baseline characteristics in subjects enrolled in Panels A and B or between those in each treatment assignment within Panel A and B. All treatments were generally well-tolerated. A total of 19 subjects reported 31 AEs (including 6 AEs with onset at prestudy/pre-treatment). There was one serious AE of breast carcinoma which was not considered by the investigator to be drug-related. None of the subjects enrolled discontinued due to an AE.

Endpoint responsiveness
The   In this instance, the 10MCP Trans PDA for Panel A used data from all three baseline scans. There were no statistically significant differences, P >0.05, between groups based on nonparametric Wilcoxon and t-tests. a Positive rheumatoid factor was defined as rheumatoid factor >= 12 KU/L, the reference range for Quest assay. CRP, C-reactive protein; DAS28, Disease Activity Score in 28 Joints; MCP, metacarpophalangeal joint; PDA; Power Doppler Area; SD, standard deviation.  None of the imaging endpoints showed a statistically significant difference between the treatment effects of the two prednisone doses. However, there was concordance in the rank ordering of effect sizes for US endpoints in keeping with the prednisone dose.

Reproducibility of summated 10MCP score US endpoints
The within-scan inter-reader, parallel scan intra-reader and parallel scan inter-reader reliability were good to excellent for the majority of US measures. Overall mean ICC values for within-scan inter-reader, parallel scan intra-reader, and parallel scan inter-reader reliability were 0.77, 0.83 and 0.61, respectively. Overall mean ICC value for all quantitative measures of synovitis was greater than for all semi-quantitative measures, 0.77 and 0.70, respectively (P = 0.77). The overall mean ICC value for all longitudinal measures of synovitis was greater than for all transverse measures, 0.80 and 0.68, respectively (P = 0.16). The overall mean ICC value for all power Doppler measures of synovitis was greater than for all gray-scale measures, 0.84 and 0.63, respectively (P = 0.002). P-values were computed assuming independence of the ICC values via Wilcoxon Rank Sum test, and are two-sided.
The endpoints with the best agreement at baseline or on treatment were the VASCi (Long and Trans), both of which are semi-quantitative measures of vascularity but assessed in different planes (Table 3).

Panel A correlations between US endpoints and DAS28 (CRP)
Correlations assessed at baseline between US endpoints and DAS28(CRP) were moderate and ranged between 0.52 and 0.68; all were statistically significant (P < 0.05). At Days 15, 8 and 2, 96% of the correlations in the placebo group were between 0.5 and 0.9 (actual range 0.23 to 0.9) and 92% of the correlations were between 0.4 and 0.8 (actual range 0.11 to 0.80) in the prednisone group.

Exploratory composite endpoint responsiveness
Each composite endpoint is a simple sum of the standardized score corresponding to the DAS28(CRP) and two US endpoints, each summed across 10 MCPJs. Effect sizes at Days 15, 8 and 2 for composite endpoints constructed using the DAS28(CRP), Trans PDA/VASCi, and Long STA/STi in Panel A are shown in Table 4. These were selected as they generally had the largest effect sizes at the three post-treatment time points. All four composite endpoints demonstrated statistical significance after only a single day of dosing, that is, at Day 2.
The effect of prednisone 7.5 mg, Panel B, on the relatively easy-to-score pre-defined composite endpoint taken from the analysis of Panel A, Z-score(DAS28 + Trans VASCi + Long STi), was significant at Days 15 and Day 8 (effect sizes 1.84 and 1.80, respectively).

Discussion
We have demonstrated that a wide range of HFUS measures of ST and PDUS measures of synovial vascularity at the MCPJs are reproducible and capable of detecting treatment effects of oral prednisone (15 mg and 7.5mg daily) after a week, and two US measures after only one day, in small panels of subjects (n = 18 and n = 27, respectively) with moderate to severely active RA. DAS28 (CRP) was only able to detect a significant treatment effect after two weeks in the 15 mg cohort. US may, therefore, be a leading indicator of therapy response occurring before a clinical response. At present, more than 50% of drugs tested fail at phase III and the expense of the traditional drug development pathway has become prohibitive for numerous novel compounds developed to selectively inhibit a range of potential therapeutic targets that have been identified for RA. Our study has extensively investigated the sensitivity and reliability of a diverse range of two-dimensional ultrasonographic endpoints at the MCPJ and their potential as tools to provide an early and objective indication of a therapeutic response to treatment intervention in RA. We have confirmed that ultrasonography of MCPJ is an early, reliable indicator of therapeutic response in RA and it thus has the potential to reduce patient numbers required as well as the duration of clinical trials designed to give a preliminary indication of efficacy. Such an approach to early drug development in RA might increase the chances of success in later phase studies designed to meet the regulatory endpoints that are required to achieve approval.
In the present study, correlations between the majority of different US endpoints and DAS28(CRP) while on prednisone treatment were between 0.4 and 0.8, suggesting that they measure somewhat different constructs. Combining US endpoints with DAS28(CRP) increased effect sizes at all time points and identified treatment effects earlier. Composite endpoints increased the endpoint sensitivity for 15 mg in Panel A. The DAS28(CRP) had an effect size of about 1.0, which would take 13 subjects per group to identify a treatment difference (alpha 0.1; 80% power). In combination with a US endpoint with combined effect size of approximately 1.5, the sample number drops to six per group. Some combinations of US endpoints with effect sizes of approximately 2.0 would require four subjects per group. Likewise, single dose effects of 15 mg prednisone were identifiable with some combinations of endpoints. These findings strongly suggest a potential value in employing such composite endpoints in future prospective small studies designed to establish an early indication of efficacy. Composite endpoints were selected from Panel A on how well they performed. They were tested in Panel B in a predefined way but in a limited capacity. These composite endpoints need to be tested in future studies to confirm their utility.
Both 15 mg and 7.5 mg prednisone represent relatively low corticosteroid doses and it would be notable if an endpoint could differentiate their effect. Overall, there was a trend towards a dose-response. Greater   numbers of subjects may have discriminated the two doses. Other factors that may have decreased the study's ability to differentiate the two doses include the fact that there were two centers, the scanning rooms of which, for example, may have been at different temperatures, and there were two ultrasonographers, the first scans of each in their respective centers were used to determine treatment effect. For Panel B the 10MCP Trans PDA demonstrated a significant treatment effect earlier than in Panel A. This may be because there were more subjects in Panel B who received active treatment, albeit at a lower dose. To support this, at Day 8 more US endpoints registered a significant treatment effect for Panel B than Panel A.
For Panel A, seven out of eight US endpoints demonstrated a consistent time-response to 15mg of prednisone. Within Panel B more US endpoints registered a significant effect size at Day 8 in comparison with Day 15 perhaps due to waning of therapeutic response to low dose corticosteroid in some subjects. The observed transient response of the US endpoints to 7.5 mg of prednisone was mirrored in the effect sizes of the DAS28(CRP) even though this latter endpoint did not show significance at any time point. We postulate that for some subjects in Panel B, 7.5mg of prednisone may be just below the threshold dose for a sustained anti-inflammatory effect. The biological response to prednisone at low doses (≤7.5 mg/day prednisone or equivalent), is not necessarily predictable in inducing and sustaining an anti-inflammatory effect in RA [29]. If we had used larger doses of prednisone in the study, for example 40mg, we would have undoubtedly seen more consistent timeresponses but this would have weakened the impact of the study as it would not have permitted a demonstration of the sensitivity of US to detect change.
The Long STA endpoint performed especially well in the current study. Our previous investigations of HFUS grayscale ST have shown inferiority to power Doppler vascularity in detecting a treatment effect with respect to the kinetics and the extent of change [27,30]. However, those studies measured synovial thickening semi-quantitatively in the transverse plane only. Semi-quantitative indices may constrain the detection of change in joints if synovial thickening greatly exceeds the largest score by delivering static scores when genuine reduction in synovial thickening can be detected quantitatively. The greater area afforded by the longitudinal versus the transverse view may have also benefited the registration of a treatment effect by the Long STA endpoint. The data in the current study support these theories: semi-quantitative measures of synovial thickening had smaller effect sizes than quantitative measures (the only exception was Day 2, panel B); transverse measures of synovial thickening had smaller effect sizes than longitudinal measures (the only exception was Day 2, Panel A; Figure 2). The treatment effect was less at Day 2 and, therefore, these factors would have had less influence at this early time point.
Most US studies have investigated reliability on a joint by joint basis. Few have assessed reliability of a summation of scores for a selected group of joints. Naredo et al. assessed within scan intra-reader reliability with a resultant excellent ICC value of 0.99 [10] for summated 4-point semi-quantitative PDUS imaging of 28 joints, called the 'overall US joint index for power Doppler signal'. Backhaus et al. [31] developed a composite US score called the 'German US7 score'. They measured HFUS synovitis and PD synovitis using 4-point semi-quantitative scales in seven joints and the within scan inter-reader reliability kappa value was 0.6. Arguably the most robust measure of reliability is the 'parallel scan inter-reader' (included in our study) because it is a comparison between two ultrasonographer-readers scanning the same patient. The images are read independently, as might be the case in multi-site clinical trials using the same model of US machine and settings. As expected the overall reproducibility for parallel scan inter-reader reliability was lower than within scan inter-reader reliability; the difference between these two methods most likely representing the loss of concordance due to image acquisition. A similar observation was reported by Kamishima et al. [32]. Despite this shortfall, in the current study good agreement was observed for the overall parallel scan inter-reader reliability. The overall parallel scan intra-reader reliability was strongest demonstrating the potential advantage of one ultrasonographer acquiring and reading the scans at a single site.
Quantitative ultrasonographic measures of synovitis demonstrated better overall reliability than semi-quantitative measures although the difference was not statistically significant. Therefore, within future studies there may still be a place for more time-consuming measures of synovitis, by computationally quantifying pixel counts, but quicker semi-quantitative scales may be an acceptable substitute. We observed that power Doppler measures of synovitis were significantly more reproducible than gray-scale measures of synovitis and we advocate that future US studies include power Doppler vascularity endpoints to deliver optimum reliability.
The dimensions of the transducers available for use in this study may have been a limitation resulting in weak inter-reader reliability (within-scan or parallel scan) for the 10MCP Trans STi and the 10MCP Trans STA. Because of the broad width of the transducer relative to the deepest point of the triangular structure (which is a narrow precise location), more than one hyperechoic line, representing bone, is often observed on the saved gray-scale image. Therefore, MS and SK may have chosen different ROIs depending on which line was selected to represent the lower border of the triangular structure, even though, from the beginning of the study, there was a consensus to use the lowest hyperechoic line.
Another limitation of our study is that the two prednisone doses were trialed in series rather than in parallel and, therefore, although comparisons can be made between treatment groups, firm conclusions are hampered. This is especially relevant when attempting to comment on the dose-response of the US endpoints to prednisone.
Due to the time constraints of scanning we restricted our US evaluation to the dorsum of the MCPJs. It may have been valuable to have assessed endpoints derived from imaging over the palmar surface also.

Conclusions
Our study confirms that ultrasonographic imaging of MCPJs could be used as an early and reliable indicator of a therapeutic response to a new treatment intervention in RA early phase clinical trials with small patient cohorts over a two-week test period and decrease the time-to-decision for progressing clinical development. By addressing the issues surrounding the reliability of US to measure synovitis objectively this study brings us closer to approving this tool as a recognized endpoint for confirming treatment effect in RA clinical trials. The semi-quantitative US endpoints demonstrated in this study, whether used alone or together with clinical measures as composite endpoints, could be used in centers not possessing quantitative analysis tools. The study also illustrates the potential utility of US to stratify patient selection by detecting those with potentially reversible baseline joint inflammation, an important consideration given concerns about bias introduced in trials by recruitment of patients with equivocal clinical swelling. Moreover, composite endpoints have the potential to further reduce patient numbers and study duration in early phase trials. Authors' contributions MWS was involved in the design of the study, acquisition, analysis and interpretation of data, drafting and final approval of the manuscript. SK was involved in the design of the study, acquisition, analysis and interpretation of data, drafting and final approval of the manuscript. CRB was involved in the conception and design of the study, analysis and interpretation of data and drafting and final approval of the manuscript. M-PM was involved in statistical analysis, interpretation of the data and final approval of the manuscript. JAB was involved in the design of the study, analysis and interpretation of the data and final approval of the manuscript. BJD was involved in the administrative support, analysis and interpretation of the data and drafting and final approval of the manuscript. ASC was involved in the administrative support, analysis and interpretation of the data and drafting and final approval of the manuscript. CEC was involved in the administrative support, analysis and interpretation of the data and drafting and final approval of the manuscript. SSS was involved in the statistical analysis, analysis and interpretation of the data and drafting and final approval of the manuscript. CMcC was involved in the administrative support, data acquisition and final approval of the manuscript. AF was involved in the administrative support, data acquisition and final approval of the manuscript. WD was involved in the interpretation and analysis of the data and final approval of the manuscript. CP was involved in the conception and design of the study, interpretation and analysis of the data, drafting and final approval of the manuscript. PCT was involved in the conception and design of the study, interpretation and analysis of the data, drafting and final approval of the manuscript.